hermes-agent

mirror of https://github.com/NousResearch/hermes-agent.git synced 2026-06-24 10:52:21 +00:00

Author	SHA1	Message	Date
teknium1	4d4ba0831e	refactor(session): simplify traversal guard to a helper + logger, harden non-leading separators Follow-up to the salvaged #9560 fix: - Replace the _TRAVERSAL_RE regex with an explicit _is_path_unsafe() helper (drops the now-unused `import re`); catches a path separator ANYWHERE, not just leading, so a non-leading Windows backslash can't slip through. - Switch the per-entry skip in _ensure_loaded_locked from print() to logger.warning to match the module's logging conventions. - Add AUTHOR_MAP entry for the contributor. - Add regression tests for the non-leading-separator case.	2026-06-21 15:23:36 -07:00
OrbisAI Security	aa2aac68b0	fix(V-009): reject Windows drive-letter paths in session field validation Extends the CWE-22 path traversal guard to cover Windows absolute paths of the form C:/... and D:\... — previously only leading / and \ were checked, which missed drive-letter prefixes. Replaces the inline startswith check with a compiled module-level regex (_TRAVERSAL_RE) that covers all three attack patterns: .., leading /\, and leading X: drives. Adds two regression tests for C:/windows/system32 and D:\\path\\to\\file. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-21 15:23:36 -07:00
OrbisAI Security	3a6a43cb81	fix(V-009): reject path traversal in SessionEntry.from_dict and harden _ensure_loaded Addresses PR #9560 review comments: applies the CWE-22 fix to current main (post-PR #458 rebase) and adds the requested regression tests. - SessionEntry.from_dict now raises ValueError for session_key or session_id containing '..' or starting with '/' or '\' (directory traversal guard) - SessionStore._ensure_loaded moves per-entry validation inside the loop so one malicious/corrupt entry is skipped with a warning instead of aborting the entire sessions.json load - Adds TestSessionEntryFromDictTraversalValidation (5 cases) and TestEnsureLoadedSkipsInvalidEntries covering the skip-not-abort behavior Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-21 15:23:36 -07:00
orbisai0security	c8eb7cf843	fix: V-009 security vulnerability Automated security fix generated by Orbis Security AI	2026-06-21 15:23:36 -07:00
ethernet	bb59075b25	Merge pull request #50398 from helix4u/fix/windows-npm-path-fallback fix(windows): prefer cmd npm shim on PATH fallback	2026-06-21 18:55:02 -03:00
devorun	6f0ecf37da	fix(redact): mask all Authorization schemes and x-api-key style headers Secret redaction only matched `Authorization: Bearer <token>`. Other auth headers passed through verbatim into logs, tool output, and transcripts: - `Authorization: Basic <base64>` — leaks base64(user:password) - `Authorization: token <pat>` / any non-Bearer scheme - `Proxy-Authorization: ...` - `x-api-key: <key>` (Anthropic and many providers) and `api-key`, `x-goog-api-key`, `x-auth-token`, `x-access-token`, ... — opaque values with no known vendor prefix were caught by nothing A logged request or an echoed `curl -H "x-api-key: ..."` command therefore leaked live credentials. Generalize the Authorization rule to mask the credential for any scheme (and Proxy-Authorization) while preserving the header name and scheme word for debuggability, and add an api-key header rule for the single-opaque-value headers. Bearer behavior is unchanged; plain prose containing the word "authorization" (no colon-delimited value) is left untouched. Adds regression tests for Basic/token/Proxy auth and the x-api-key/api-key headers, including inside a curl command.	2026-06-21 14:08:06 -07:00
teknium1	87ab373381	test(url-safety): cover IPv6 scope-ID strip + fail-closed in URL guards Follow-up to the salvaged #25961 fix: regression tests asserting that scope-bearing IPv6 addresses (fe80::1%eth0, ::1%lo) are blocked by is_safe_url after the scope is stripped, that a still-unparseable address fails closed, and that a scoped IPv4-mapped IMDS address is caught by the always-blocked floor.	2026-06-21 13:56:35 -07:00
sprmn24	ed966696eb	fix(security): handle IPv6 scope IDs in URL safety checks to prevent bypass ipaddress.ip_address() raises ValueError on IPv6 addresses with scope IDs (e.g. 'fe80::1%eth0'). Both is_always_blocked_url() and is_safe_url() silently skipped these via `except ValueError: continue`. If ALL resolved addresses for a hostname carry scope IDs, every address is skipped and the URL passes all safety checks — a potential SSRF bypass vector against link-local or metadata endpoints. Fix: - Strip the scope ID (%eth0) before parsing in both functions - is_safe_url(): fail closed (return False) with a warning log if still unparseable after stripping - is_always_blocked_url(): use continue (not return False) to preserve multi-address scanning, with a warning log Affected: tools/url_safety.py — is_always_blocked_url(), is_safe_url()	2026-06-21 13:56:35 -07:00
liuhao1024	b5b8a4cd56	fix(gateway): respect adapter decline of fresh-final to prevent double delivery When a streamed Telegram reply finalizes, the stream consumer could take the fresh-final path (send a new sendRichMessage + best-effort delete the preview) purely because the time-based _should_send_fresh_final() threshold elapsed — even though Telegram's prefers_fresh_final_streaming returns False. The fresh Rich Message then overlapped the legacy MarkdownV2 preview already on screen, leaving both visible (the #47048 table + bullet double-render). Honor the adapter's decision: when prefers_fresh_final_streaming exists on the adapter (checked on the class + instance __dict__ so MagicMock auto-attrs don't false-positive) and declines, the time threshold no longer overrides it. Adapters without the hook keep the time-based fresh-final for backward compat. Fixes #47048	2026-06-21 13:55:50 -07:00
teknium1	f79e0a7060	fix(email): mark missing-config as non-retryable + reject blank env vars (#40715 ) Fold in the #40715 blank-env OOM fix on top of the host-resolution change: - connect() now sets a non-retryable fatal error when required settings are missing, so the gateway stops reconnecting against an empty host instead of looping forever and leaking memory until the host OOM-kills. - check_email_requirements() treats blank/whitespace-only EMAIL_* values as missing, so an abandoned setup with empty keys no longer enables the platform. Credits the parallel fixes by zerone0x (#40745) and liuhao1024 (#40829).	2026-06-21 13:33:52 -07:00
teknium1	e921c4f826	chore(release): map devorun salvage author email	2026-06-21 13:33:52 -07:00
devorun	b7f6cb9c8b	fix(email): resolve IMAP/SMTP host from config and validate before connecting The email adapter read address/host purely from env vars and never stripped them, so a missing or whitespace-padded EMAIL_IMAP_HOST reached imaplib.IMAP4_SSL("") and surfaced as the misleading "[Errno 8] nodename nor servname provided, or not known" — sending users down a DNS rabbit hole when the real problem was an empty/dirty host string. A config.yaml-only setup also left the host empty because __init__ ignored PlatformConfig.extra, even though the "connected" check, the send helper, and `hermes config show` already read address/imap_host/smtp_host from it. Resolve address/imap_host/smtp_host from the env var first, then fall back to config.extra, and strip surrounding whitespace — matching the send helper's existing pattern. Validate the required settings at the start of connect() and return False with an actionable message instead of attempting a connection with an empty host. Adds regression tests for whitespace stripping, config.extra fallback, and the no-IMAP-attempt-on-missing-host path.	2026-06-21 13:33:52 -07:00
teknium1	4cff0360ea	test(approval): regression for interrupt-unblocks-approval; AUTHOR_MAP - Add thread-scoped regression test: interrupt on the waiting thread resolves the approval as deny well under the 300s timeout; a foreign-thread interrupt does NOT release the wait (interrupts are per-thread). - Add panghuer023 to AUTHOR_MAP for the salvaged #37994 fix.	2026-06-21 13:33:48 -07:00
panghuer023	a9c8025984	fix(approval): honor interrupt in blocking gateway approval wait (#8697 ) A dangerous-command gateway approval blocks the agent's execution thread inside _await_gateway_decision() on threading.Event.wait() until the user responds or the 5-minute approval timeout fires. The poll loop never checked is_interrupted(), so /stop (which flags the agent's execution thread via AIAgent.interrupt()) was silently ignored — the session stayed wedged until timeout, even though /stop reported the session unlocked. Check is_interrupted() at the top of the poll loop. The wait runs on the agent's execution thread, the exact thread interrupt() flags, so the check sees the signal and resolves the pending approval as deny — the agent loop receives a normal denial and unwinds cleanly. Covers /stop, /new, and the gateway inactivity-timeout interrupt through the single shared wait loop used by both the terminal and execute_code guards.	2026-06-21 13:33:48 -07:00
Teknium	824c9d3812	fix(config): alias model.api_base -> model.base_url for custom providers (#50385 ) A bare custom provider configured via `model.api_base` (the intuitive name OpenAI-SDK / LiteLLM users reach for) was silently ignored: `hermes config set` accepts any dotted key, so `model.api_base` got written and confirmed, but the runtime resolver reads only `model.base_url`. Requests fell back to OpenRouter with an empty key -> 401, zero hits to the custom endpoint (issue #8919). Now api_base is migrated to base_url at load time (fixes existing broken configs) and at set time (with a notice), never overriding an explicit base_url. Closes #8919.	2026-06-21 13:33:41 -07:00
Teknium	bb77a8b0d5	fix(gateway): respawn unmapped Windows gateways after update (#50090 ) (#50373 ) On Windows, _pause_windows_gateways_for_update() force-kills every running gateway before mutating the venv. Gateways mapped to a profile (via profile.path/gateway.pid) were respawned afterward, but gateways with NO profile mapping — e.g. a Windows Scheduled Task running "pythonw.exe -m hermes_cli.main gateway run" — were force-killed and only told to restart manually. After an auto-update/bootstrap the Telegram bot stayed dead until manual intervention. Now we snapshot each unmapped gateway's argv (psutil, guarded by looks_like_gateway_command_line) before the kill and replay it through the same detached watcher used for profile gateways, so unmapped gateways come back automatically too. Co-authored-by: Hermes Agent <agent@nousresearch.com>	2026-06-21 13:33:26 -07:00
Teknium	99f3072aa0	fix(model-switch): a failed in-place swap must be a no-op, not a dead session (#50375 ) When a /model switch resolves a valid model but the in-place agent swap fails mid-conversation (expired key, unreachable base_url), the agent rolls itself back to the old working model+client and re-raises. The callers caught that re-raise, logged a warning, then committed the broken switch anyway: wrote the failed model to the session DB, set _session_model_overrides to the broken model/provider/key, and (gateway direct path) evicted the working cached agent. The next message then rebuilt a dead agent from the broken override -> permanently unusable conversation (#50163). Fix the whole caller class so a failed swap aborts the commit entirely: - gateway/slash_commands.py (picker + direct /model paths): on swap failure, early-return an error message; skip DB persist, session override, cache eviction, and config write. - cli.py (both /model handlers): snapshot CLI-level credential/runtime fields before mutating, restore them on swap failure, and abort the note + success print. - tui_gateway/server.py: wrap the previously-unguarded swap; on failure raise a clean error and skip worker restart, runtime persist, switch marker, session model_override, and config persist. The no-cached-agent path (apply-on-next-session) is unaffected. Adds a gateway regression test that fails on the pre-fix behavior.	2026-06-21 13:33:23 -07:00
memosr	ed3d12a762	fix(security): fail-closed when WebSocket peer is empty in loopback mode Per @egilewski's audit on this PR (#15544), the original fix was correct but the file has refactored since: the four endpoint-local empty-peer checks have been consolidated into _ws_client_is_allowed and _ws_client_reason, but the helpers were left fail-open ('no peer host known means allow' / 'no reason to block'). On a loopback-bound dashboard with auth disabled, an ASGI server behind a misconfigured proxy or a unix-socket transport can deliver ws.client == None or ws.client.host == ''. The helpers were treating that as 'allowed', so the loopback-only peer gate could be bypassed by anything that suppressed the client tuple in transit. All four WebSocket endpoints (/api/pty, /api/ws, /api/pub, /api/events) route through _ws_request_is_allowed -> _ws_client_is_allowed, so the gap applied uniformly. Fix: * _ws_client_is_allowed: return False when client_host is empty instead of True. Only reached on loopback bind with auth disabled (auth_required=True and explicit non-loopback binds short-circuit earlier), so the fail-closed behavior is scoped to the surface that needs it. * _ws_client_reason: return a 'missing_or_empty_peer bound=...' block reason instead of None, so the dispatcher's existing reason-based rejection path picks it up and the close gets logged with a machine-parseable token for diagnosability. Behavior unchanged for: * gated mode (auth_required=True) — early-returns True before the empty-peer check runs. The OAuth ticket is the auth at that point. * explicit non-loopback bind (--host 0.0.0.0/::, or a specific LAN address, always with --insecure) — early-returns True before the empty-peer check runs. DNS-rebinding is still blocked by the Host/Origin guard in _ws_host_origin_is_allowed. * legitimate loopback peers (client_host == '127.0.0.1' / '::1') — not affected by the empty-peer branch. Regression tests added in tests/hermes_cli/test_dashboard_auth_ws_auth.py: * test_empty_client_host_rejected_in_loopback_mode * test_missing_client_object_rejected_in_loopback_mode * test_empty_client_host_reason_is_block Plus two regression guards to ensure the fix does not over-reach: * test_empty_client_host_still_allowed_in_insecure_public_mode * test_empty_client_host_still_allowed_in_gated_mode All three new fail-closed tests fail without this patch (the helpers return True / None for an empty peer) and pass with it. The 45 pre-existing tests in test_dashboard_auth_ws_auth.py continue to pass.	2026-06-21 13:33:18 -07:00
sgaofen	a4b1554c73	fix(whatsapp): normalize bare phone targets to JIDs before bridge send Baileys' jidDecode crashes ("Cannot destructure property 'user' of jidDecode(...) as it is undefined") when handed a bare phone number, so sending a WhatsApp message to +50766715226 / 50766715226 returned HTTP 500 and never delivered (#8637). Add to_whatsapp_jid() to gateway/whatsapp_identity.py — the outbound inverse of normalize_whatsapp_identifier: it builds the JID a send must use (bare phone -> <digits>@s.whatsapp.net) and passes through already qualified JIDs (@g.us, @lid, status@broadcast, @newsletter) unchanged. Wire it at every outbound bridge call site in the WhatsApp adapter (send, edit, media, typing, get_chat_info, and the standalone cron / send_message sender). Co-authored-by: Hermes Agent <noreply@nousresearch.com>	2026-06-21 13:32:22 -07:00
Teknium	f72690825e	fix(desktop/windows): stop in-app update from cascading into a backend restart loop (#50381 ) When a Windows user relaunches Hermes while an in-app update is still running (the desktop vanished with no progress and looks crashed), the fresh instance spawns its own dashboard backend. That backend re-locks the venv shim, the updater's straggler cleanup (force_kill_other_hermes -> taskkill /F /T /IM hermes.exe) kills it, the launch dies with the 45s "backend didn't come up" timeout, and the user relaunches into the same trap -- an infinite respawn/kill loop (#50238). Root cause: no mutual exclusion between an applying update and a fresh desktop spawning its own local backend. Fix: the updater publishes a HERMES_HOME/.hermes-update-in-progress marker (pid + start time) for the whole run via an RAII drop-guard that removes it on every exit path (success, early return, panic). A freshly-launched desktop checks the marker before spawning its local backend and PARKS until the update finishes -- then brings the backend up itself (it is the surviving instance; the updater's own relaunch hits the single-instance lock and quits). A stale marker (dead pid or past a 20-minute ceiling) is pruned so a crashed updater can never strand future launches. No rogue backend spawns mid-update, so force_kill_other_hermes has nothing legitimate to kill. Marker parse/staleness logic is extracted to update-marker.cjs and unit-tested; the Rust guard has unit tests; the Rust-write <-> JS-read contract is E2E-verified.	2026-06-21 13:10:32 -07:00
LeonSGP43	09a96ba0f6	fix(gateway): pause Telegram typing before stream finalize In Telegram streaming, the typing indicator persisted through the slow final rich-text/MarkdownV2 finalize edit, so the '...typing' bubble lingered for seconds after the last streamed token. Add a one-shot on_before_finalize hook to GatewayStreamConsumer, fired once when the stream transitions into its finalization path, and wire it on both Telegram streaming call sites to call pause_typing_for_chat() before the final edit. Cover hook ordering and once-only behavior in tests. Fixes #49712	2026-06-21 13:10:25 -07:00
teknium1	6902eb3913	fix(cli): make ZIP-update directory replace atomic so it can't delete ui-tui Root cause of #49145: the Windows ZIP-update path did rmtree(dst) then copytree(src, dst). If the copy failed partway — common on that path, which only runs because file I/O is already flaky on the machine — the directory was left deleted with nothing copied back. ui-tui/ vanishing is what broke 'hermes --tui' (WinError 267), but the bug hit every top-level directory. _atomic_replace_dir stages the new copy into a sibling temp dir and only swaps it in on full success, restoring the original on failure. A failed update now leaves the live tree untouched instead of half-deleted.	2026-06-21 13:10:22 -07:00
teknium1	db097fb088	fix(cli): auto-restore a deleted ui-tui workspace from git before TUI launch The Windows update path can leave tracked ui-tui/ files deleted in the working tree (HEAD intact). The guard now self-heals: when ui-tui/ is missing in a git checkout, run `git restore -- ui-tui` and continue, falling back to the printed manual-recovery steps only when git can't recover it (no checkout / restore failed). Builds on konsisumer's missing-workspace guard.	2026-06-21 13:10:22 -07:00
konsisumer	537ad9ea9a	fix(cli): guard missing ui-tui workspace before TUI launch	2026-06-21 13:10:22 -07:00
峯岸　亮	5b45fb269a	fix(security): sanitize kanban markdown html	2026-06-21 13:10:17 -07:00
helix4u	7502d38bf9	fix(windows): prefer cmd npm shim on PATH fallback	2026-06-21 14:06:39 -06:00
Teknium	8e4d2fd23f	docs(plugins): document acting from hooks via ctx.profile_name + dispatch_tool (#50352 ) Answers a recurring plugin-author question: how to read the active profile and drive Hermes from inside a hook callback when ctx._cli_ref is None (gateway, hermes chat -q, and kanban-spawned worker sessions). - Adds a 'Act from inside a hook' section to the plugin guide covering ctx.profile_name and ctx.dispatch_tool as the session-agnostic APIs, with a kanban_task_blocked example, and notes there is no in-process slash-command bridge for headless workers (shell out via the terminal tool instead). - Adds the three kanban lifecycle hooks to the hook reference table with their process semantics. - Pins the contract with a regression test: ctx.dispatch_tool invokes a tool handler with _cli_ref=None (worker/hook context). Requested by @Smithangshu on Discord.	2026-06-21 12:54:40 -07:00
teknium1	b6f03ab891	docs(ui-tui): add billing.step_up.verification event + perfPane.tsx to README Follow-up on salvaged #50347: the event surface table was missing the billing.step_up.verification switch case, and the File map omitted lib/perfPane.tsx.	2026-06-21 12:52:22 -07:00
alelpoan	d7737bfd97	docs(ui-tui): fix file paths, add billing command, update file map	2026-06-21 12:52:22 -07:00
Teknium	d164ed0326	fix(kanban): make reclaim claim-lock-aware to stop task/run status desync (#50366 ) After a worker crash + reclaim + respawn, the board could show a task in the Ready lane while its task_run was 'running' and the new worker was actively executing (#36910). The dispatcher could then treat live work as available and double-assign. Root cause: the three reclaim paths (detect_crashed_workers, release_stale_claims heartbeat-stale backstop, enforce_max_runtime) each snapshot a task's worker_pid/claim_lock, do liveness work, then reset tasks.status back to 'ready' with only a 'WHERE status=running' guard. If the task was reclaimed AND re-claimed by a NEW worker in between (new run, new claim_lock, live pid), the stale UPDATE clobbered the live task: status flipped to 'ready' while the fresh run stayed 'running'. claim_task is the only writer that sets status='running', so nothing put it back — permanent desync. Fix: gate each reset on the snapshot's claim_lock (and worker_pid where available) so it only fires when the task is still owned by the worker the reclaim was computed for. A stale reclaim now no-ops (rowcount 0) instead of desyncing a re-claimed task. Genuine crashes (lock still matches) reclaim exactly as before. This is the same race class the in-gateway dispatch lock (single-writer ticks) mitigates, closed at the row level so a single dispatcher's fast reclaim->respawn across two ticks is also safe. Closes #36910.	2026-06-21 12:49:07 -07:00
memosr	87615f47b9	test(backup): add regression tests for restore_quick_snapshot path traversal Per @egilewski's audit on this PR, the security fix is behaviorally correct but lacks focused regression coverage for the two traversal vectors it closes. Adding tests now so the path-traversal guard cannot silently regress. * test_restore_rejects_snapshot_id_traversal -- exercises the snapshot_id input guard with seven hostile values (parent traversal, single parent, bare '.', bare '..', forward slash, backslash, empty string). Each must return False without touching the filesystem. * test_restore_rejects_manifest_rel_traversal -- exercises the manifest rel guard by injecting '../../outside.txt' into a real snapshot's manifest.json, seeding a source payload at the escaped path, and asserting the destination outside HERMES_HOME does not exist after restore. This is the higher-value test of the pair -- verified locally that it fails without the fix in restore_quick_snapshot (the escape destination gets written) and passes with the fix in place. The 67 pre-existing tests in test_backup.py continue to pass.	2026-06-21 12:44:22 -07:00
memosr	ae46699905	fix(security): validate snapshot_id and file paths in restore_quick_snapshot to prevent path traversal	2026-06-21 12:44:22 -07:00
Teknium	1f4c5aed6d	fix(kanban): honor kanban.auto_decompose toggle live, without a gateway restart (#50358 ) The gateway dispatcher captured kanban.auto_decompose ONCE at boot, so a user who flipped it to false to STOP auto-decompose had no way to make that take effect short of restarting the gateway. Reported (#49638): auto-decompose created and launched tasks the user never intended (while they were still typing the task description), and 'even Hermes Agent couldn't disable this feature' — because the live config edit was silently ignored. Auto-decompose is a safety toggle; turning it off must halt fan-out on the next tick. The dispatcher now re-reads the flag (and auto_decompose_per_tick) from config every tick via the extracted _resolve_auto_decompose_settings(), which fails SAFE (disabled) on a config read error so a transient failure can never re-enable a feature the user turned off. Closes #49638.	2026-06-21 12:43:44 -07:00
Teknium	84ba83b09a	fix(kanban): bound the cross-process init lock so connect() can't hang forever (#50353 ) connect() wrapped its entire body in an unbounded blocking flock(LOCK_EX) on every call (_cross_process_init_lock). A single process stalled inside the critical section — or a stale lock held by a wedged worker — blocked every other connect(), including the long-lived gateway dispatcher's next-tick connect, forever. No timeout, no traceback, no recovery: the board silently stopped being worked until a manual restart (issue #36644). Two fixes: 1. Fast-path skip: once THIS process has initialized a path, the expensive first-open work (header validation, integrity probe, schema + additive migrations) is already cached in _INITIALIZED_PATHS. The steady-state connect has nothing for the cross-process lock to protect, so it now opens the connection (WAL + pragmas) under only the cheap in-process _INIT_LOCK and never touches the file lock. This removes the lock from the dispatcher's hot path entirely — a stalled external 'hermes kanban list' can no longer block ticks. 2. Bounded acquire: even on first-init, _cross_process_init_lock now retries a non-blocking acquire up to a 10s deadline, then logs a WARNING and proceeds WITHOUT the cross-process lock. Safe because the in-process _INIT_LOCK still serializes same-process threads and the init work is idempotent (CREATE TABLE IF NOT EXISTS + additive migrations) — worst case is redundant work, not corruption. A bounded 'proceed anyway' beats an unbounded hang. Windows path switched LK_LOCK -> LK_NBLCK (non-blocking) to match. Closes #36644.	2026-06-21 12:43:41 -07:00
Teknium	9630ec6c19	fix(kanban): pin worker TERMINAL_CWD to the task workspace (#50348 ) _default_spawn launched the worker subprocess with cwd=workspace and set HERMES_KANBAN_WORKSPACE, but never set TERMINAL_CWD — so the worker inherited the dispatching gateway's TERMINAL_CWD. That value takes precedence over the process cwd in two places: - tools/file_tools.py::_resolve_base_dir — a relative write_file path resolved against the gateway user's home instead of the workspace, so artifacts silently landed outside the workspace (#41312). - agent_init's context-file loader — AGENTS.md was discovered relative to the gateway's cwd, so under multi-profile dispatch a worker loaded whichever gateway won the claim race's AGENTS.md, not the task's (#34619). Both are the same root cause. Pinning TERMINAL_CWD to the workspace (where the task's work actually happens) fixes both. Guarded on an existing absolute dir because file_tools rejects relative/sentinel TERMINAL_CWD values — a non-dir workspace leaves the inherited value rather than writing a meaningless one. Closes #34619, closes #41312.	2026-06-21 12:43:37 -07:00
Teknium	b6d1072408	fix(cli): branch new worktrees from the fresh remote tip, not stale local HEAD (#50355 ) hermes -w created the worktree branch from the standalone clone's HEAD, which lags origin when the clone isn't freshly updated (it's only refreshed by hermes update, not per session). Every worktree branch then rooted on a stale base, so the PR diff GitHub computes against current main ballooned with unrelated changes and the agent had to discover the staleness at push time and rebase. _resolve_worktree_base() now fetches and branches from the freshest available ref: the current branch's upstream if it tracks one (so a deliberate feature-branch worktree tracks its own remote), else the remote's default branch (origin/HEAD), else local HEAD as a fail-soft fallback (offline / no remote / detached). A bogus 'origin/(unknown)' default is guarded, and worktree creation retries from HEAD if branching off the remote ref fails — so this is never worse than the old behavior. Gated by worktree_sync (default true); set worktree_sync: false to keep the old branch-from-local-HEAD behavior. The resolved base is printed in the session banner. This is the follow-up to the #50319 session, where the standalone clone was 213 commits behind origin and the worktree inherited that stale base.	2026-06-21 12:42:11 -07:00
Teknium	e217fd42e2	feat(kanban): add task lifecycle plugin hooks (claimed/completed/blocked) (#50349 ) Plugins could observe session/tool/approval lifecycle but had no way to observe kanban task transitions. Adds three observer hooks fired by the board's claim/complete/block transitions: - kanban_task_claimed (dispatcher process, before worker spawn) - kanban_task_completed (worker process, carries summary) - kanban_task_blocked (worker process, carries reason) Each fires AFTER the DB write txn commits, so a plugin observes durable state and a slow/hanging callback can never hold the SQLite write lock. All firing is best-effort: a raising hook is logged and swallowed and never breaks a board transition. profile_name is resolved from HERMES_HOME so dispatcher- and worker-side hooks carry the right profile. Requested by @Smithangshu on Discord.	2026-06-21 12:38:14 -07:00
Teknium	9d883ac90e	feat(plugins): add ctx.profile_name for session-agnostic profile access (#50346 ) Plugins previously had no way to read the active profile name from the PluginContext. The workaround in the wild — reaching into ctx._manager._cli_ref — only works in an interactive CLI session; _cli_ref is None in the gateway and in kanban-spawned worker sessions (hermes -p <profile> chat -q ...), so the workaround breaks exactly where multi-profile awareness matters most. ctx.profile_name wraps hermes_cli.profiles.get_active_profile_name(), which derives the name from HERMES_HOME and therefore works in every execution context with zero dependency on _cli_ref.	2026-06-21 12:38:11 -07:00
Teknium	7d9f6a24f5	chore(release): add AUTHOR_MAP entry for #48678 salvage	2026-06-21 12:36:26 -07:00
natehale	565b7c8d9d	fix(telegram): stop typing indicator lingering after final reply After the agent's final response, the '...typing' bubble persisted ~5s. send() re-triggers send_typing() after every delivery so the bubble survives intermediate progress messages (Telegram clears typing on each delivered message). But that re-trigger also fired on the FINAL send, re-arming Telegram's ~5s timer AFTER the gateway had already torn down its typing-refresh loop — and Telegram exposes no stop-typing API, so nothing cancelled it. Gate the post-send re-trigger on the absence of metadata['notify'] (set only on the final user-visible reply via _mark_notify_metadata). Both the rich-message and legacy send paths are covered; intermediate progress sends still re-trigger so the bubble stays alive mid-response. Fixes #48678	2026-06-21 12:36:26 -07:00
Teknium	c0409a87ff	feat(gateway): typed send-error classification (SendResult.error_kind) (#50342 ) Add a platform-neutral send-failure vocabulary so consumers can branch on a typed category instead of substring-matching the raw provider message. - base.py: SEND_ERROR_KINDS + classify_send_error() (too_long / bad_format / forbidden / not_found / rate_limited / transient / unknown), and an optional SendResult.error_kind field (defaults None — fully backward compatible). - telegram.py: populate error_kind on send() failures; message_too_long keeps its existing error token plus error_kind='too_long'. Purely additive: no behavioral change to the existing degrade-and-deliver paths (MarkdownV2->plain-text fallback, overflow split, retry classification all untouched). 22 new tests + 210 adapter regression tests green.	2026-06-21 12:34:22 -07:00
teknium1	6bbacc2238	fix(desktop): make cold-start port-announcement deadline tolerant The port-announcement clock in waitForDashboardPort starts the instant the backend process is spawned — before uvicorn binds its socket. On a cold install the child first compiles and imports the whole hermes_cli.main -> web_server -> FastAPI/uvicorn chain, and on Windows real-time AV scans every freshly written .pyc. That pre-bind cost can exceed the old hardcoded 45s deadline, so the desktop killed a healthy-but-still-starting backend and respawned it, piling up orphaned processes (#50209). Raise the default to 90s and make it overridable via HERMES_DESKTOP_PORT_ANNOUNCE_TIMEOUT_MS, clamped to a 45s floor so a bad override can't reintroduce the loop. Warm starts still announce in well under a second; both call sites inherit the new default with no change. Adds backend-ready.test.cjs (wired into test:desktop:platforms).	2026-06-21 12:29:18 -07:00
joaomarcos	e580706d4d	test(web_server): add integration tests for desktop boot handshake fix Three tests covering the scenarios from issue #50209 that could not be validated with real Defender on a fresh install: 1. test_lifespan_warmup_is_nonblocking Patches _warm_gateway_module to sleep 3 s. Measures TestClient startup time — must complete in < 1.5 s, proving the fire-and-forget run_in_executor does not block the event loop before port binding (HERMES_DASHBOARD_READY timing proxy). 2. test_get_status_does_not_block_event_loop Patches _resolve_restart_drain_timeout to sleep 3 s. Fires concurrent GET /api/status and GET /api/version requests. /api/version must respond in < 3 s while /api/status waits — proving the event loop stays free during the slow import (15 s socket timeout would not fire). 3. test_concurrent_status_probes_all_respond Three simultaneous /api/status probes with the slow patch — all must return HTTP 200 (no connection resets, no orphan accumulation). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-21 12:29:18 -07:00
joaomarcos	475e81dab4	fix(web_server): use run_in_executor for gateway pre-warm and drain-timeout Fixes a regression introduced by the prior approach (synchronous import hermes_cli.gateway inside _lifespan) that caused a new failure mode: the blocking import stalled the asyncio event loop before uvicorn could bind its port, pushing HERMES_DASHBOARD_READY past the desktop shell's 45 s announcement deadline and triggering a respawn loop that accumulated orphaned backend processes. Two-part fix: _lifespan: replace the blocking import with a fire-and-forget run_in_executor call (_warm_gateway_module). The import runs in a worker thread while the server socket is already open, so HERMES_DASHBOARD_READY fires without delay. get_status: replace the inline lazy import with await run_in_executor(None, _resolve_restart_drain_timeout). This is the root fix for the original 15 s socket-timeout: the blocking .pyc-compilation + Defender scan is offloaded to a thread, keeping the event loop free for every /api/status probe. After the first call the module is in sys.modules and the executor returns in microseconds. Both helpers are extracted as module-level sync functions so they can be unit-tested independently of FastAPI or uvicorn. Closes #50209 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-21 12:29:18 -07:00
Teknium	5e3e89cc05	feat(hindsight): configurable embedded daemon health grace timeout (#50341 ) On resource-contended hosts the embedded Hindsight daemon can exceed a single 2s /health check; upstream then waits a grace window before treating it as stale and killing+restarting it (hindsight-embed reads HINDSIGHT_EMBED_PORT_HEALTH_GRACE_TIMEOUT, default 30s, into a module-level constant at import time). Users on busy boxes had no Hermes-side way to raise it short of hand-setting an env var. Add a 'port_health_grace_timeout' config.json option to the Hindsight plugin. When set, initialize() exports it to the process env BEFORE daemon_embed_manager is imported (the import-time read is the contract). setdefault() so an explicit operator env override always wins. Exposed in 'hermes memory setup' for local_embedded mode. Follow-up to #50308 / issue #13125 comment thread.	2026-06-21 12:20:53 -07:00
Eugeniusz Gilewski	def3f6388f	fix(file): anchor device symlink guard to task cwd The read_file device guard now walks symlink hops before the file operation layer, but that hop walk still interpreted relative paths against the Python process cwd. In sessions where TERMINAL_CWD points at the task workspace, a relative workspace symlink to a blocked alias such as /dev/../dev/stdin could therefore miss the intermediate device target before later task-cwd resolution. Anchor relative device checks to the task base before symlink-hop inspection so the pre-I/O guard sees the same workspace path that read_file would otherwise read. Absolute device paths and the existing final realpath fallback remain unchanged. Refs #10141 Refs #29158	2026-06-21 12:16:10 -07:00
teknium1	e267237671	test(photon): cover overflow retry, typing cooldown, sidecar-crash detection Follow-up for salvaged PR #50256. Unit tests for the three behaviors: retryable classification of Envoy/sidecar overflow strings, per-chat typing cooldown with stop_typing reset, and the _supervise_sidecar crash-detection path that raises a retryable fatal (and the clean-shutdown no-op).	2026-06-21 12:15:44 -07:00
joaomarcos	9578e52795	fix(photon): detect unexpected sidecar death and trigger reconnect When the Node spectrum-ts sidecar process exited mid-session (crash, OOM, upstream overflow escalation), _supervise_sidecar returned silently — readline hit EOF, the log-pump loop broke, and nothing notified the gateway. _inbound_loop entered an infinite retry loop against a dead port, _running stayed True, and the adapter remained in self.adapters with no path to self-recovery short of a manual gateway restart. Add a death-detection tail to _supervise_sidecar: after the log-pump exits (EOF or exception), guard on _inbound_running to distinguish unexpected death from a deliberate disconnect(). On unexpected exit, call _set_fatal_error("SIDECAR_CRASHED", retryable=True) followed by _notify_fatal_error() so the reconnect watcher picks up the platform within 30 s and retries with exponential backoff (30 s → 300 s cap) until the sidecar comes back up. All other platforms remain unaffected. The _inbound_running guard is safe against races: disconnect() sets _inbound_running = False before _stop_sidecar() cancels the supervisor task. CancelledError is BaseException, not Exception, so it bypasses the except clause and propagates normally — the detection block never runs during a clean shutdown.	2026-06-21 12:15:44 -07:00
joaomarcos	2a4542333e	fix(photon): classify Envoy overflow errors as retryable; add typing cooldown Closes #50185 Two independent gaps let a transient Photon/Spectrum upstream overflow degrade message delivery and amplify gRPC pressure: 1. _is_retryable_error did not recognise Photon- or Envoy-specific error strings ("internal sidecar error", "upstream connect error", "reset reason: overflow"), so _send_with_retry fell through to the plain-text fallback immediately instead of backing off and retrying. 2. send_typing had no rate gate, so a burst of typing-indicator calls during an overflow event kept hitting the upstream gRPC connection and widened the failure window. Fix: - Add _PHOTON_RETRYABLE_PATTERNS with the three high-specificity Envoy / sidecar substrings and override _is_retryable_error on PhotonAdapter to check them after delegating to the base-class patterns. base.py and all other adapters are untouched. - Add a 5 s per-chat cooldown in send_typing backed by _typing_last_sent. stop_typing clears the entry so the next start after a completed turn fires immediately — only rapid consecutive starts without a stop are suppressed. - Reduce PhotonAdapter._send_with_retry default max_retries from 2 to 1 (single 2 s back-off check) — enough to confirm whether the Envoy circuit-breaker has opened, without adding unnecessary latency. All changes are scoped to plugins/platforms/photon/adapter.py.	2026-06-21 12:15:44 -07:00
Teknium	7a131f7f40	fix(api-server): stop silently promising async delivery on stateless HTTP path (#50319 ) * fix(api-server): stop silently promising async delivery on stateless HTTP path terminal(notify_on_complete=True / watch_patterns) and delegate_task(background=True) silently no-op'd on the API server / WebUI path (#10760): the watcher / detached child registered, but every API-server route (OpenAI-spec /v1/chat/completions and /v1/responses, plus the proprietary /v1/runs SSE stream) tears down its channel when the turn ends, and APIServerAdapter.send() is a no-op stub. A completion that fires after the response closed had nowhere to go — from the agent side, indistinguishable from a hang. There is no spec-compliant surface to wake the agent later on a stateless HTTP client, so make the no-op honest instead of silent: - Add a per-adapter capability flag supports_async_delivery (default True; APIServerAdapter = False), propagated into a HERMES_SESSION_ASYNC_DELIVERY contextvar via async_delivery_supported(). Toggle on the adapter, not a hardcoded platform string — a future stateless adapter is correct-by-default. - terminal: when delivery is unsupported, skip watcher registration, force notify_on_complete off, and return a notify_unsupported note telling the agent to process(action='poll'). - delegate_task: when delivery is unsupported, fall back to SYNCHRONOUS execution (work runs and returns in the same response) with a note, instead of handing out a handle that never resolves. CLI (in-process completion_queue) and the real gateway platforms are unchanged. Fixes #10760 * refactor(api-server): route session binding through a single no-delivery chokepoint Add APIServerAdapter._bind_api_server_session() and route both agent-entry paths (_run_agent for /v1/chat/completions + /v1/responses, and the /v1/runs _run_sync path) through it. The helper hardwires platform="api_server" and async_delivery=False with no async_delivery parameter to pass, so a future route added to the API server physically cannot reintroduce the silent no-op (#10760) by forgetting to mark the channel as non-delivering. The binding stays request-scoped (cleared per turn), so a session resumed later on a delivering interface (CLI / gateway platform) re-binds fresh and is NOT blocked — the no-delivery decision tracks the interface handling the current turn, never the session.	2026-06-21 12:15:14 -07:00

1 2 3 4 5 ...

12453 commits