hermes-agent

mirror of https://github.com/NousResearch/hermes-agent.git synced 2026-06-24 10:52:21 +00:00

Author	SHA1	Message	Date
Teknium	d164ed0326	fix(kanban): make reclaim claim-lock-aware to stop task/run status desync (#50366 ) After a worker crash + reclaim + respawn, the board could show a task in the Ready lane while its task_run was 'running' and the new worker was actively executing (#36910). The dispatcher could then treat live work as available and double-assign. Root cause: the three reclaim paths (detect_crashed_workers, release_stale_claims heartbeat-stale backstop, enforce_max_runtime) each snapshot a task's worker_pid/claim_lock, do liveness work, then reset tasks.status back to 'ready' with only a 'WHERE status=running' guard. If the task was reclaimed AND re-claimed by a NEW worker in between (new run, new claim_lock, live pid), the stale UPDATE clobbered the live task: status flipped to 'ready' while the fresh run stayed 'running'. claim_task is the only writer that sets status='running', so nothing put it back — permanent desync. Fix: gate each reset on the snapshot's claim_lock (and worker_pid where available) so it only fires when the task is still owned by the worker the reclaim was computed for. A stale reclaim now no-ops (rowcount 0) instead of desyncing a re-claimed task. Genuine crashes (lock still matches) reclaim exactly as before. This is the same race class the in-gateway dispatch lock (single-writer ticks) mitigates, closed at the row level so a single dispatcher's fast reclaim->respawn across two ticks is also safe. Closes #36910.	2026-06-21 12:49:07 -07:00
memosr	ae46699905	fix(security): validate snapshot_id and file paths in restore_quick_snapshot to prevent path traversal	2026-06-21 12:44:22 -07:00
Teknium	84ba83b09a	fix(kanban): bound the cross-process init lock so connect() can't hang forever (#50353 ) connect() wrapped its entire body in an unbounded blocking flock(LOCK_EX) on every call (_cross_process_init_lock). A single process stalled inside the critical section — or a stale lock held by a wedged worker — blocked every other connect(), including the long-lived gateway dispatcher's next-tick connect, forever. No timeout, no traceback, no recovery: the board silently stopped being worked until a manual restart (issue #36644). Two fixes: 1. Fast-path skip: once THIS process has initialized a path, the expensive first-open work (header validation, integrity probe, schema + additive migrations) is already cached in _INITIALIZED_PATHS. The steady-state connect has nothing for the cross-process lock to protect, so it now opens the connection (WAL + pragmas) under only the cheap in-process _INIT_LOCK and never touches the file lock. This removes the lock from the dispatcher's hot path entirely — a stalled external 'hermes kanban list' can no longer block ticks. 2. Bounded acquire: even on first-init, _cross_process_init_lock now retries a non-blocking acquire up to a 10s deadline, then logs a WARNING and proceeds WITHOUT the cross-process lock. Safe because the in-process _INIT_LOCK still serializes same-process threads and the init work is idempotent (CREATE TABLE IF NOT EXISTS + additive migrations) — worst case is redundant work, not corruption. A bounded 'proceed anyway' beats an unbounded hang. Windows path switched LK_LOCK -> LK_NBLCK (non-blocking) to match. Closes #36644.	2026-06-21 12:43:41 -07:00
Teknium	9630ec6c19	fix(kanban): pin worker TERMINAL_CWD to the task workspace (#50348 ) _default_spawn launched the worker subprocess with cwd=workspace and set HERMES_KANBAN_WORKSPACE, but never set TERMINAL_CWD — so the worker inherited the dispatching gateway's TERMINAL_CWD. That value takes precedence over the process cwd in two places: - tools/file_tools.py::_resolve_base_dir — a relative write_file path resolved against the gateway user's home instead of the workspace, so artifacts silently landed outside the workspace (#41312). - agent_init's context-file loader — AGENTS.md was discovered relative to the gateway's cwd, so under multi-profile dispatch a worker loaded whichever gateway won the claim race's AGENTS.md, not the task's (#34619). Both are the same root cause. Pinning TERMINAL_CWD to the workspace (where the task's work actually happens) fixes both. Guarded on an existing absolute dir because file_tools rejects relative/sentinel TERMINAL_CWD values — a non-dir workspace leaves the inherited value rather than writing a meaningless one. Closes #34619, closes #41312.	2026-06-21 12:43:37 -07:00
Teknium	e217fd42e2	feat(kanban): add task lifecycle plugin hooks (claimed/completed/blocked) (#50349 ) Plugins could observe session/tool/approval lifecycle but had no way to observe kanban task transitions. Adds three observer hooks fired by the board's claim/complete/block transitions: - kanban_task_claimed (dispatcher process, before worker spawn) - kanban_task_completed (worker process, carries summary) - kanban_task_blocked (worker process, carries reason) Each fires AFTER the DB write txn commits, so a plugin observes durable state and a slow/hanging callback can never hold the SQLite write lock. All firing is best-effort: a raising hook is logged and swallowed and never breaks a board transition. profile_name is resolved from HERMES_HOME so dispatcher- and worker-side hooks carry the right profile. Requested by @Smithangshu on Discord.	2026-06-21 12:38:14 -07:00
Teknium	9d883ac90e	feat(plugins): add ctx.profile_name for session-agnostic profile access (#50346 ) Plugins previously had no way to read the active profile name from the PluginContext. The workaround in the wild — reaching into ctx._manager._cli_ref — only works in an interactive CLI session; _cli_ref is None in the gateway and in kanban-spawned worker sessions (hermes -p <profile> chat -q ...), so the workaround breaks exactly where multi-profile awareness matters most. ctx.profile_name wraps hermes_cli.profiles.get_active_profile_name(), which derives the name from HERMES_HOME and therefore works in every execution context with zero dependency on _cli_ref.	2026-06-21 12:38:11 -07:00
joaomarcos	475e81dab4	fix(web_server): use run_in_executor for gateway pre-warm and drain-timeout Fixes a regression introduced by the prior approach (synchronous import hermes_cli.gateway inside _lifespan) that caused a new failure mode: the blocking import stalled the asyncio event loop before uvicorn could bind its port, pushing HERMES_DASHBOARD_READY past the desktop shell's 45 s announcement deadline and triggering a respawn loop that accumulated orphaned backend processes. Two-part fix: _lifespan: replace the blocking import with a fire-and-forget run_in_executor call (_warm_gateway_module). The import runs in a worker thread while the server socket is already open, so HERMES_DASHBOARD_READY fires without delay. get_status: replace the inline lazy import with await run_in_executor(None, _resolve_restart_drain_timeout). This is the root fix for the original 15 s socket-timeout: the blocking .pyc-compilation + Defender scan is offloaded to a thread, keeping the event loop free for every /api/status probe. After the first call the module is in sys.modules and the executor returns in microseconds. Both helpers are extracted as module-level sync functions so they can be unit-tested independently of FastAPI or uvicorn. Closes #50209 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-21 12:29:18 -07:00
Teknium	e581740aa1	fix(kanban): single-writer dispatch lock to prevent orphan-dispatcher DB corruption (#50331 ) A shell-launched 'hermes gateway run --replace' / 'gateway restart' on a systemd/launchd host can leave an orphan gateway whose kanban dispatcher escapes the service cgroup, survives 'systemctl restart', and becomes a second long-lived writer on the shared kanban.db. Two dispatchers that each believe they own the file both pass SQLite busy_timeout and then race on WAL frames — the documented root cause of multi-writer corruption (issue #35240). The existing _guard_supervised_gateway_conflict startup guard blocks the common way an orphan is born, but does nothing once a second dispatcher already exists. This adds the defense-in-depth: dispatch_once now wraps every tick in a non-blocking, board-scoped flock (_dispatch_tick_lock). A losing dispatcher returns DispatchResult(skipped_locked=True) and does zero DB writes this tick — so two dispatchers can never run a reclaim/spawn/write sequence concurrently regardless of how the second one got there. - Non-blocking (LOCK_NB): never stalls the gateway's async watcher. - Board-scoped: lock file is a .dispatch.lock sibling of each board's kanban.db, so unrelated boards tick in parallel. - POSIX + Windows (fcntl / msvcrt LK_NBLCK), no-op degrade where neither exists — mirrors the existing _cross_process_init_lock pattern. Verified with a real two-process orphan repro: while a separate process holds the lock, dispatch_once skips; after release it runs.	2026-06-21 12:06:24 -07:00
Teknium	587b5b9ac2	fix(backup): capture memory-provider state stored outside HERMES_HOME (#50325 ) hermes backup only walks HERMES_HOME, so memory providers that keep config/credentials in home-anchored dotdirs (honcho -> ~/.honcho, hindsight -> ~/.hindsight, openviking -> ~/.openviking) lost that data across a backup/import cycle — the peer IDs, session pairings, and API keys never made it into the archive. Add an optional MemoryProvider.backup_paths() hook (default []). The active provider declares its external paths; backup resolves them from config only (no init, no network), archives the ones under the home dir into a reserved _external/ subtree encoded relative to home, and import restores them to their original location with a home-anchored traversal guard and 0600 on credential-shaped files. Paths outside home are skipped as non-portable. honcho, hindsight, and openviking override the hook. E2E-validated full backup->import cycle plus 7 new tests.	2026-06-21 12:03:46 -07:00
kn8-codes	6183e8ce1b	fix(telegram): make Bot API 10.1 rich messages opt-in (default off) Rich messages are not ready for primetime: current Telegram clients can render Bot API 10.1 rich messages as blank/unsupported bubbles and make them hard to copy as plain text, which is worse than the legacy MarkdownV2 path for command snippets and mobile handoffs. Default the rich_messages toggle to False so replies stay on the copyable legacy path; users opt in per bot via platforms.telegram.extra.rich_messages: true. Updates adapter, gateway config default, example config, English + zh-Hans docs, and the default/opt-in tests.	2026-06-21 12:03:24 -07:00
sgaofen	93ea9b04af	fix(gateway): cap inbound media download size to prevent memory exhaustion Inbound image/audio/video payloads were buffered fully into process memory before being written to the cache, with no size limit. A large upload (Discord Nitro allows 500 MB) or a remote media URL in an inbound message pointing at a huge file could spike RAM and OOM-kill the gateway. Enforce a configurable cap in the shared cache helpers (gateway/platforms/ base.py) so the protection holds across every platform adapter, not one: - cache_image/audio/video_from_bytes reject oversized payloads before writing (video was the gap in the original report — now covered). - cache_image/audio_from_url stream the body, rejecting on an oversized Content-Length header and re-checking the running total per chunk so an absent/lying header can't smuggle an unbounded body past the cap. - Discord's _read_attachment_bytes checks att.size up front, so an oversized attachment is rejected before any bytes are pulled into memory. Configurable via gateway.max_inbound_media_bytes in config.yaml (default 128 MiB; 0 disables). No new env var — non-secret config lives in config.yaml. Salvaged and extended from @sgaofen's PR #13341 (the original report and the shared-helper approach). Reapplied onto current main (Discord adapter has since moved to plugins/platforms/discord/), the configurable knob moved from an env var to config.yaml, and the video cache helper added. Co-authored-by: Hermes Agent <noreply@nousresearch.com>	2026-06-21 11:56:46 -07:00
Teknium	a18bae65b9	fix(config): redact api_key in config show/set output (#50245 ) (#50313 ) hermes config show printed the model dict raw via print(), bypassing the logging redactor; a custom-provider api_key (e.g. Cloudflare cfut_...) was shown in plaintext even with security.redact_secrets=true. Opaque tokens don't match any vendor-prefix regex, so structural key-name masking is required. - Add redact_config_value(): recursively masks credential-shaped keys (api_key/token/secret/... exact-match) via mask_secret. - Wrap the show_config model dump in it. - Mask the set_config_value echo when the leaf key is credential-shaped (config set model.api_key routes to config.yaml, lowercase misses the .env allowlist).	2026-06-21 11:50:31 -07:00
Teknium	03563dabac	fix(gateway): raise session-hygiene hard message limit 400 → 5000 (#50194 ) The gateway pre-compression hygiene valve force-compressed any session crossing 400 messages regardless of token usage. On large-context (1M+) models doing many short, message-dense turns, a healthy session at ~16% token usage could hit 400 messages and get force-compressed — and the compression summary's stale Active Task could then bleed into the next turn. The valve's actual purpose is to break a death spiral: when API calls keep disconnecting on an oversized session, no token-usage data arrives, the token threshold never fires, and the transcript grows unbounded. It's a count-based floor for that pathological case only. 400 was tuned for ~200K-context models and is far too low for modern large-context sessions. Raise the default to 5000 — still well clear of any death spiral, but no longer firing on legitimate long conversations. The value remains fully configurable via compression.hygiene_hard_message_limit.	2026-06-21 08:26:19 -07:00
Teknium	e499d69e3e	feat(api-server): configurable concurrent-run cap to prevent DoS (#50007 ) The OpenAI-compatible API server only enforced a hardcoded cap of 10 concurrent runs on /v1/runs, leaving /v1/chat/completions and /v1/responses unbounded — a request flood could exhaust CPU, memory, and upstream LLM quota (#7483). - Add gateway.api_server.max_concurrent_runs (config.yaml, default 10, 0 disables). No env var. - Shared concurrency gate across all three agent-serving endpoints, counting both the chat/responses in-flight counter and the /v1/runs stream set. Returns OpenAI-style 429 + Retry-After when at the cap. - Remove the dead hardcoded _MAX_CONCURRENT_RUNS class attribute. Closes #7483.	2026-06-21 07:26:03 -07:00
kshitijk4poor	4d7bb382b0	refactor(gateway): route all active_agents coercion through parse_active_agents; harden drain-timeout fallback Second cleanup pass (simplify-code review of the first follow-up): - write_runtime_status now clamps active_agents via parse_active_agents instead of an inline max(0, int(...)). Removes the duplicated clamp the helper's docstring acknowledged AND closes a write-side ValueError gap (a non-numeric active_agents previously raised; now degrades to 0). - hermes_cli/gateway.py draining-status line routes its active-agents count through parse_active_agents too — the third coercion site of the same persisted field, now consistent and non-raising with the two HTTP surfaces. - web_server.py /api/status: the drain-timeout resolver fallback now catches ImportError specifically and falls back to DEFAULT_GATEWAY_RESTART_DRAIN_TIMEOUT (a real float) instead of a blanket 'except Exception -> None'. None would have violated the surfaced field's int/float contract and stripped NAS's poll-deadline hint silently. - Dropped a redundant 'if runtime else 0' branch (parse_active_agents already handles the empty/None case) and tightened the parse_active_agents docstring to describe the actual single-contract role (write + both reads).	2026-06-21 17:22:52 +05:30
kshitijk4poor	b577f25100	refactor(gateway): dedupe drain-timeout resolution + share active_agents parse Follow-up cleanups on top of the busy/idle readout (PR #50103): - web_server.py /api/status reused the single drain-timeout resolver hermes_cli.gateway._get_restart_drain_timeout() (HERMES_RESTART_DRAIN_TIMEOUT env -> agent.restart_drain_timeout config -> default) instead of inlining a third hand-rolled copy of that precedence chain. Also fixes a subtle divergence: the inline copy used os.environ.get() so a set-but-empty env var was treated as a value rather than falling through to config; the shared resolver .strip()s and falls through correctly. - Added gateway.status.parse_active_agents() and routed BOTH HTTP surfaces (/api/status and /health/detailed) through it, so the exposed active_agents field is consistently clamped non-negative. Previously /api/status clamped while /health/detailed exposed the raw file value, diverging on a corrupt count. - Added TestParseActiveAgents covering the shared coercion contract.	2026-06-21 17:22:52 +05:30
Ben	0ee75469d7	feat(dashboard): surface gateway busy/drainable on /api/status Give an external consumer (NAS) a trustworthy, always-reachable busy/idle readout it can poll before a disruptive lifecycle action (restart, migrate, stop, auto-update). The dashboard /api/status is the only HTTP surface guaranteed up on a hosted agent regardless of which gateway platforms are enabled, and it already reads gateway_state.json. Add to /api/status (additive, non-breaking): - active_agents — in-flight gateway-turn count (now refreshed per-turn by the companion gateway-side commit) - gateway_busy — running AND active_agents > 0 - gateway_drainable — running and live (a valid begin-drain target) - restart_drain_timeout — resolved seconds, so the consumer can size its poll deadline without out-of-band knowledge (env HERMES_RESTART_DRAIN_TIMEOUT → config agent.restart_drain_timeout → default) The busy/drainable contract is defined once in gateway.status (derive_gateway_busy / derive_gateway_drainable) and consumed by both /api/status and /health/detailed so the two surfaces can never disagree. Liveness keys off gateway_running (a live PID/health probe), NEVER gateway_updated_at — a healthy idle gateway never advances that timestamp. All derived fields degrade to safe falsy values when the gateway is down or the status file is absent/corrupt (never a spurious "busy" that would wedge the consumer). active_sessions (the 5-min DB recency heuristic the SPA reads) is left exactly as-is — new signal, new fields. Tests (behaviour contracts, not snapshots): the pure derivation contract across every running/state/count/liveness combination; /api/status integration for busy, idle-drainable, draining, down, stale-busy-file, corrupt-count, and timeout surfacing; and /health/detailed parity.	2026-06-21 17:22:52 +05:30
kshitijk4poor	1ca29723f0	fix(cli): log instead of swallow preflight-warning errors; consistent TUI warning field Follow-up to the salvaged preflight-compression warning: - Replace silent `except Exception: pass` at all 5 guard call sites (cli.py x2, gateway/slash_commands.py x2, tui_gateway/server.py) with `logger.debug(...)` so signature drift in the guard helper isn't hidden. - tui_gateway/server.py: set the confirm dict's `warning` field to the merged message (was bare expensive-model text) so it matches `confirm_message` for any future consumer reading `warning`. - Add trailing newlines to the two new files.	2026-06-21 16:31:56 +05:30
Tuna Dev	04730f32e7	fix(cli): warn when in-session model switch will preflight-compress Adds hermes_cli/context_switch_guard.py mirroring the model_cost_guard pattern. When a user switches models mid-session (Herm TUI picker, CLI, or /model on Telegram/Discord), the warning surfaces on the existing ModelSwitchResult.warning_message path used by the expensive-model guard if the new model's compression threshold is below the current session size. Partial fix for #23767 — addresses only the 'user-facing guardrail when switching from a high-context provider to a substantially lower-context provider' slice. The other proposed fixes from that issue (hard preflight token guard, metadata cache invalidation on switch, compression safety invariant, oversized tool-output handling) are out of scope for this PR.	2026-06-21 16:29:31 +05:30
kshitij	f6a504d088	Merge pull request #50025 from NousResearch/salvage/cron-run-immediate fix(cron): execute job immediately on action=run	2026-06-21 13:53:13 +05:30
kyssta-exe	65d7c7fafd	fix(cron): execute job immediately on action='run' `cronjob(action='run')` (and `hermes cron run`) only set `next_run_at = now` and returned success, relying on the scheduler ticker to actually execute the job on its next tick. When no gateway/ticker is running — a CLI-only setup, or the Windows case in #41037 — the job never executed: `run` reported success, but `last_run_at` stayed null forever, no output, no delivery. A manual `run` should actually run. `_execute_job_now` now: - claims the job via `claim_job_for_fire` — the same at-most-once CAS the scheduler/external-provider fire path uses. This both advances `next_run_at` for recurring jobs and blocks a concurrently-running gateway ticker from double-firing the same job; if the claim is lost, the run is skipped (the tool reports `execution_skipped`). This closes the double-fire race that a bare `advance_next_run` left open (a tick whose `get_due_jobs` already captured the job between trigger and advance would still fire it). - delegates firing to `run_one_job` — the single shared execute→save→deliver→mark body the ticker and external providers use — so failure delivery, `[SILENT]` handling, and live-adapter delivery stay identical across paths and can't drift. (The original salvage re-implemented this sequence inline and had already dropped failure delivery + `[SILENT]`.) The tool response carries `executed`, `execution_success`, and either `execution_error` or `execution_skipped`. The `hermes cron run` CLI message no longer claims "It will run on the next scheduler tick" — it reports the actual "Ran now: succeeded/failed" outcome (or the skip). Salvaged from #41130 by @kyssta-exe (authorship preserved); reworked to reuse `claim_job_for_fire` + `run_one_job` per review rather than re-implementing the fire sequence inline. Adds tests for the claim-then-fire path, claim-lost skip, failure reporting, and exception capture. Fixes #41037 Co-authored-by: kyssta-exe <kyssta-exe@users.noreply.github.com>	2026-06-21 13:28:04 +05:30
annguyenNous	07424da76f	fix(cron): keep ticker alive on BaseException + heartbeat-aware status The in-process cron ticker (cron/scheduler_provider.py) caught only `Exception` and logged at DEBUG, so a `SystemExit`/`KeyboardInterrupt` raised from a misbehaving provider SDK or agent retry path killed the ticker thread silently. The gateway PROCESS stayed up, so `hermes cron status` — which only checks `find_gateway_pids()` — kept reporting "✓ jobs will fire automatically" while no jobs ever fired (#32612, #32895). This makes ticker death survivable and detectable: - The ticker loop now catches `BaseException` and logs at ERROR with a traceback, so a single bad tick no longer tears the thread down and the failure is visible in the gateway log. - The loop records a heartbeat (`cron/ticker_heartbeat`, epoch seconds) on startup and after every tick — best-effort, never raised into the loop. Both ticker entry points (the gateway and the desktop fallback in web_server.py) funnel through `InProcessCronScheduler.start`, so one heartbeat site covers both. - `hermes cron status` now reads the heartbeat age: if the gateway is running but the heartbeat is stale (> 200s, i.e. several missed ~60s ticks), it reports the ticker as STALLED and suggests a restart instead of falsely claiming jobs will fire. A missing heartbeat (older build / never ran) is treated as "unknown", not "dead". Adds tests for BaseException survival, per-iteration heartbeat recording, heartbeat round-trip/age, staleness detection, and silent-write-failure. Salvaged from #49660 (BaseException survival on current structure), extended with the heartbeat + honest-status reporting that the earlier (pre-refactor) watchdog PRs #35616 and #33849 proposed. Fixes #32612 Fixes #32895 Co-authored-by: banditburai <promptsiren@gmail.com> Co-authored-by: sweetcornna <96944678+sweetcornna@users.noreply.github.com>	2026-06-21 13:00:50 +05:30
Tortugasaur	c02648c5dd	fix(docs): align slash-command and docker docs	2026-06-20 23:23:47 -07:00
Kevin Anderson	b337afdf6e	docs(cli): fix broken terminal-backend guide link in setup wizard The terminal backend onboarding step pointed at /docs/developer-guide/environments, which no longer exists. Point it at the live docs page /docs/user-guide/configuration#terminal-backend-configuration. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-20 23:23:47 -07:00
EloquentBrush0x	9bd5003d4f	fix(spotify): quarantine dead tokens on terminal refresh failure resolve_spotify_runtime_credentials() called _refresh_spotify_oauth_state() without a try/except, so a terminal failure (HTTP 400/401, invalid_grant, refresh_token_reused) raised AuthError but left the dead refresh_token in auth.json. Every subsequent session re-read and retried the same token over the network, failing identically each time. Fix: wrap the refresh call and, when exc.relogin_required is True and a refresh_token is present, clear the dead OAuth fields (access_token, refresh_token, expires_at, expires_in, obtained_at) and write a last_auth_error quarantine marker to auth.json before re-raising. The next call sees no access_token and fails fast with spotify_access_token_missing — no network retry — and the user is prompted to re-authenticate. Mirrors the quarantine pattern already in place for Nous, xAI-OAuth, Codex-OAuth (#28116, #28118), and MiniMax-OAuth (#28119).	2026-06-20 23:23:47 -07:00
teknium1	7ace96ba40	fix(compression): preserve goal, platform, and session indexing across rotation Three state-loss bugs at the compression rotation boundary, fixed together because they all live in the same ~80-line rotation block: - #33618: a persistent /goal did not follow the rotation. load_goal does a flat per-session lookup with no lineage walk, so a goal silently died when compression minted a fresh child id. Added migrate_goal_to_session() and call it after the child session is created (move-not-copy: the parent row is archived as cleared so exactly one active goal row exists). - #33906/#33907: if the child create_session raised (FK constraint, contended write), the outer handler only warned and let the agent continue on the NEW id — which has no row in state.db — producing an orphan session. Now the rotation rolls agent.session_id back to the still-indexed parent (reopening it) instead of stranding the conversation on a phantom id. - #27633: the compaction-boundary on_session_start notification omitted the platform kwarg, so context-engine plugins saw source=unknown for every message after the boundary. Forward platform (matching the initial session-start call in agent_init.py). Co-authored-by: denisqq <21260182+denisqq@users.noreply.github.com> Co-authored-by: zccyman <16263913+zccyman@users.noreply.github.com> Co-authored-by: liuhao1024 <sunsky.lau@gmail.com>	2026-06-20 20:06:24 -07:00
Teknium	170ef24c8f	fix(doctor): audit WhatsApp bridge at its resolved (HERMES_HOME) dir (#49890 ) doctor's npm audit hardcoded PROJECT_ROOT/scripts/whatsapp-bridge. In read-only Docker installs the bridge deps live in the writable HERMES_HOME mirror (#49561), so node_modules was never found there and the bridge audit silently skipped. Resolve the dir through the shared resolve_whatsapp_bridge_dir() helper so doctor audits where deps actually install. Falls back to the install-tree path if the helper is unavailable.	2026-06-20 19:55:12 -07:00
teknium1	15cfc2836f	fix(kanban): anchor no-path worktree tasks on board default_workdir Follow-up to the salvaged worktree-materialization fix. When a worktree task has no explicit workspace_path, resolve the anchor from the board's default_workdir (a git repo) and materialize <repo>/.worktrees/<id> per task, instead of silently rooting under the dispatcher's CWD (whatever directory launched the gateway, e.g. the Hermes checkout). If no default_workdir is configured, raise with a clear message rather than guessing from CWD. Adds AUTHOR_MAP entry for the salvaged commit.	2026-06-20 19:12:23 -07:00
Ahmad Ashfaq	d79f67fda6	fix(kanban): materialize and reuse linked worktrees for worktree tasks The dispatcher treated workspace_kind=worktree as metadata only and never ran 'git worktree add', so every worktree task ran in the main repo checkout instead of an isolated worktree — concurrent tasks silently shared one tree and contaminated each other. This materializes a real linked worktree at <repo>/.worktrees/<task_id> on branch wt/<task_id> when resolve_workspace() handles a worktree task, treats a repo-root workspace_path as shorthand for that location, persists the derived workspace/branch back onto the task row, and — on rerun/redispatch — detects an already-materialized linked worktree (via git-common-dir) and reuses it instead of nesting a second .worktrees/<id> inside it.	2026-06-20 19:12:23 -07:00
Zheng Tao	491579fa05	fix(whatsapp): resolve bridge dir with HERMES_HOME mirror in Docker In Docker the install tree (/opt/hermes) is read-only, so npm install for the WhatsApp bridge fails with EACCES. Add resolve_whatsapp_bridge_dir() in whatsapp_common.py: when the install dir is read-only, mirror the bridge source into a writable HERMES_HOME location and use that. Both the adapter and the 'hermes whatsapp' CLI resolve through the shared helper so the install and runtime paths agree. Fixes #49561	2026-06-20 17:05:27 -07:00
kshitijk4poor	69716a2e6f	docs(compression): fix stale 'discarded' wording on in_place config flag Review nit (yoniebans): the config.py comment still said compaction is 'lossy: the pre-compaction transcript is discarded, matching Claude Code / Codex' — leftover from the original destructive design. The shipped behavior is soft-archive: lossy for the LIVE context (what the model reloads), but the pre-compaction turns are kept on disk (active=0, compacted=1), searchable via session_search and recoverable. Comment now says so. Comment-only; no behavior change.	2026-06-20 10:57:07 -07:00
kshitijk4poor	47fadc24d7	feat(compression): in-place compaction option that keeps one session id (#38763 ) Context compression today rewrites the message list AND rotates the session id — it ends the session, forks a parent_session_id child, and renumbers the title (name -> name #2). That moving identity key is the root cause of a whole bug cluster: /goal lost (#33618), pending response lost at the split (#14238), orphan sessions (#33907), TUI sid desync (#36777), FTS search gaps + duplicate sidebar entries (#45117), null continuation cwd (#42228), and title-rename dead-ends (#48989). It also forced a large defensive apparatus (compression lock, contextvar/env/ logging triple-sync, orphan finalization, gateway SessionEntry re-propagation, tip projection) whose only job is surviving a mid-conversation id change. Add a compression.in_place config flag (default False during rollout). When True, compaction rewrites the transcript and rebuilds the system prompt but keeps the SAME session_id: no end_session, no child row, no title renumber, no contextvar/logging re-sync, no memory/context-engine session-switch. The conversation keeps one durable id for life, like Claude Code / Codex. Compaction is lossy by design — the pre-compaction transcript is summarized away, not archived. The rotation path is unchanged when the flag is off (moved verbatim into an else branch). Staged rollout: this PR ships the option behind a default-off flag for live validation; a follow-up flips the default and deletes the now-redundant rotation machinery, superseding the 14 open band-aid PRs in this area. - hermes_cli/config.py: add compression.in_place (default False), documented - agent/agent_init.py: resolve the flag -> agent.compression_in_place - agent/conversation_compression.py: branch compress_context() on the flag - tests/run_agent/test_in_place_compaction.py: in-place invariants + rotation regression guard + config default The pre-flush of current-turn messages (#47202) runs in BOTH modes, so no boundary data loss. Prompt-cache invariant preserved: the system-prompt rebuild is the same single sanctioned invalidation that already happens during compaction — no NEW invalidation. Message alternation preserved.	2026-06-20 10:57:07 -07:00
teknium1	37a4dd4982	fix(auth): heal poisoned Nous inference URL on refresh instead of retaining it A nous inference_base_url that fails the host allowlist (e.g. a stale stg-inference-api.nousresearch.com persisted before the allowlist existed) was only replaced 'if refreshed_url:' — so when the validator rejected the URL it left the poisoned value in place. The 'falling back to default' warning fired but never took effect: every subsequent call, including the auxiliary compression call, kept hitting the dead staging endpoint and 401'd. Reset to DEFAULT_NOUS_INFERENCE_URL when validation returns None at both refresh sites in resolve_nous_runtime_credentials, so a poisoned auth.json self-heals on the next refresh. The proxy adapter already did this correctly; this brings the two auth.py sites in line.	2026-06-20 10:53:45 -07:00
Teknium	11c6f4c7bc	feat(setup): Blank Slate setup mode — minimal agent, opt in to everything (#36733 ) * feat(setup): Blank Slate setup mode — minimal agent, opt in to everything Adds a third first-time setup option alongside Quick Setup and Full Setup. Blank Slate forces ON only what an agent needs to run — provider & model, the File Operations toolset, and the Terminal toolset — and turns everything else OFF, then walks the user through opting each capability back in. What it does: - platform_toolsets.cli = [file, terminal] (explicit, authoritative list) - agent.disabled_toolsets = every other known toolset (web, browser, code_execution, vision, memory, delegation, cronjob, skills, image_gen, kanban, …). Applied last in the resolver, so it overrides the non-configurable platform-toolset recovery that would otherwise re-add toolsets like kanban — guaranteeing a true blank slate. - Optional config features off: compression, memory + user-profile capture, checkpoints, smart model routing, auto session reset. - Bundled skills default to NONE (reuses the .no-bundled-skills marker); offers to seed the full catalog. - Walks through tools / plugins / MCP / messaging, all opt-in. Proven end-to-end: with the Blank Slate config, model_tools.get_tool_definitions emits exactly 6 schemas — patch, process, read_file, search_files, terminal, write_file. Nothing else reaches the model. Re-enable later via hermes tools / hermes skills opt-in --sync / hermes setup agent. Tests: tests/hermes_cli/test_setup_blank_slate.py (8 tests) pin the writers, the resolver invariant ({file, terminal}), and the 6-schema end-to-end set. Docs: getting-started/quickstart.md documents all three setup modes. * feat(setup): Blank Slate fork — finish minimal, or walk through configs After applying the minimal baseline (provider/model + file + terminal, everything else off), Blank Slate now presents a choice instead of always running the full walkthrough: 1. Start with everything disabled — finish now with the minimal agent. 2. Walk through all configurations — opt in to tools, skills, plugins, MCP, and messaging. Provider/model and terminal are still configured first either way (the agent can't run without them). The finish-now path records the bundled-skill opt-out so future `hermes update` runs don't re-inject skills. The walkthrough body moved to a separate _blank_slate_walkthrough() helper. Tests: TestBlankSlateFork covers both branches (finish-now applies baseline + skill opt-out and skips the walkthrough; walkthrough path invokes it). Docs updated to describe the fork.	2026-06-20 10:45:55 -07:00
Teknium	5600105478	refactor(gateway): migrate slack/dingtalk/whatsapp/matrix/feishu/telegram/wecom/email/sms adapters to bundled plugins Salvage of PR #41284 onto current main. Relocates the last 9 inline messaging adapters (+ satellites: telegram_network, feishu_comment/_rules/meeting_invite, wecom_crypto, wecom_callback) from gateway/platforms/ into self-contained bundled plugins under plugins/platforms/<x>/, discovered via the platform registry. Strips the per-platform core touchpoints from gateway/run.py, gateway/config.py, hermes_cli/gateway.py, hermes_cli/setup.py, and tools/send_message_tool.py. Carries forward the migration fixes (explicit enabled:false honored, get_connected_platforms forces discovery, plugin is_connected via gateway.get_env_value, logs --component gateway matches plugins.platforms.*, matrix hidden on Windows). Additionally ports config keys main added since the PR base: the matrix plugin's _apply_yaml_config now also covers allowed_users, ignore_user_patterns, process_notices, and session_scope (the inline gateway/config.py matrix block gained these in the 1340 commits the PR sat open; they would otherwise have been silently dropped on deletion).	2026-06-20 10:26:45 -07:00
kshitijk4poor	a7dd98c860	fix(env): guard remaining malformed int/float env var casts with utils helpers Widen the env_float() guard from #48735 across the whole bug class: a non-numeric value (e.g. a stale .env "HERMES_API_TIMEOUT=abc" or a typo'd port) raised an unhandled ValueError and crashed adapter/agent init. Converts 22 genuinely-unguarded first-party int/float(os.getenv()) sites to the canonical utils.env_int / utils.env_float helpers (the established house pattern), instead of duplicating per-module helpers or inline try/except: - gateway/config.py: WECOM_CALLBACK_PORT, BLUEBUBBLES_WEBHOOK_PORT - gateway/platforms/email.py: EMAIL_IMAP/SMTP_PORT, EMAIL_POLL_INTERVAL - gateway/platforms/feishu.py: dedup cache + text/media batch settings - gateway/platforms/wecom.py, discord/adapter.py: text batch delays - gateway/platforms/telegram.py: media batch delay, TELEGRAM_WEBHOOK_PORT - gateway/platforms/whatsapp.py: WHATSAPP_NPM_INSTALL_TIMEOUT - hermes_cli/auth.py: CODEX/XAI refresh timeouts - agent/chat_completion_helpers.py: API/stream read/stale timeouts - run_agent.py, agent/auxiliary_client.py: API + nous timeouts Sites already guarded by try/except or local helpers are left untouched. The HERMES_MAX_ITERATIONS sites are already guarded on main via _current_max_iterations(), so they are not included.	2026-06-20 14:54:36 +05:30
helix4u	c253b07380	fix(model): clear stale endpoint credentials across switches	2026-06-19 19:58:26 -07:00
helix4u	95a3affc2e	fix(model): keep Nous picker from restoring stale custom keys	2026-06-19 19:58:26 -07:00
Teknium	cf58f1a520	feat(titles): support language-aware title generation (#45296 ) Make auxiliary title prompts match the user language by default, with an optional pinned `auxiliary.title_generation.language` config.	2026-06-19 17:15:52 -07:00
teknium1	64b21e50fb	fix(cli): publish agent ref to cli module so memory on_session_end fires on exit The god-file Phase 4 refactor (`094aa85c37`) moved agent construction into CLIAgentSetupMixin, which set the atexit shutdown reference with a bare `global _active_agent_ref`. After extraction that global binds the mixin module's namespace, not cli.py's. cli._run_cleanup reads cli._active_agent_ref to decide whether to fire the memory provider's on_session_end hook — and it stayed None for the whole session, so the `if _active_agent_ref:` branch was dead and on_session_end never ran on /exit. Custom memory providers silently lost end-of-session extraction. Fix: publish the reference onto the cli module explicitly (`import cli as _cli; _cli._active_agent_ref = self.agent`), using the deferred-import pattern already established in the mixin. Regression test asserts cli._active_agent_ref is populated by the mixin's publish line and guards against a relapse to the bare `global` form. The existing shutdown tests passed only because they hand-assigned the ref, which is exactly what masked this.	2026-06-19 16:59:43 -07:00
kshitijk4poor	d4e7dd609d	refactor(windows): tidy managed-node resolver helpers Behavior-preserving cleanups on the managed-node resolver: - Hoist _candidate_node_command_names() out of the inner dir loop in find_hermes_node_executable (computed once, not per directory). - Drop redundant os.environ.copy() at the two with_hermes_node_path( os.environ.copy()) sites \u2014 the helper already copies os.environ when called with no argument (verified env-equivalent). - Add reciprocal keep-in-sync comments between iter_hermes_node_dirs() (hermes_constants.py) and hermesManagedNodePathEntries() (electron main.cjs), which mirror the same platform-ordering rule across the Python/Node boundary.	2026-06-20 02:12:16 +05:30
kshitijk4poor	fcc169057d	fix(windows): prefer managed npm for hermes update desktop-rebuild gate The `hermes update` desktop-rebuild gate still used a bare `shutil.which("npm")` presence check. On a Windows box where the only working npm is the Hermes-managed npm.cmd (not on PATH), the gate would skip the desktop rebuild even though _build_web_ui / cmd_gui can now find it via find_node_executable. Route the gate through the same resolver for full bug-class coverage. Surfaced during review of #49239.	2026-06-20 02:01:24 +05:30
helix4u	7a7b56d498	fix(windows): prefer managed node for whatsapp and desktop	2026-06-20 02:00:37 +05:30
teknium1	2bd1977d8f	chore: release v0.17.0 (2026.6.19)	2026-06-19 12:38:31 -07:00
alt-glitch	88d523220f	fix(mcp): address adversarial review round 2 (stale-publish race, parity holes) Second review pass (Codex + Hermes subagent). Codex reproduced a real race with a two-thread harness; both converged on the remaining issues. - Generation-aware publish (fixes a lost-update race): two refresh callers (the late-refresh daemon and the between-turns prologue around turn 1) could each compute a snapshot outside the lock; a SLOWER caller holding an OLDER registry generation could acquire the publish lock after a newer caller and clobber it, deleting just-landed tools. refresh_agent_mcp_tools now captures registry._generation before computing and refuses to publish a stale set; agent._tool_snapshot_generation tracks the published generation. - Context-engine routing names (_context_engine_tool_names) are now staged on a local and published atomically with the snapshot, and only claimed when this rebuild actually appended the schema — matching agent_init's dedup so a registry/plugin tool of the same name keeps its own dispatch. (Previously mutated live, before the publish lock, and on no-change refreshes.) - CLI /reload-mcp: self.enabled_toolsets is resolved once at startup, so a server newly ENABLED in config mid-session wasn't picked up (TUI already re-resolved). Merge now-connected MCP server names into the override (unless the user pinned all/*), mirroring startup, and keep self.enabled_toolsets in sync. Closes the CLI/TUI parity hole. - ACP (acp_adapter/server.py) routed through the shared helper — it was a 5th sibling rebuild that re-injected memory tools but NOT context-engine tools and bypassed the atomic/name-diff path (inert today, fragile). - mcp_startup._resolve_discovery_timeout pulls its default from DEFAULT_CONFIG (single source of truth) instead of a stale hardcoded 5.0 literal. - Tests: stale-generation-no-clobber, _skip_mcp_refresh honored, timeout fallback uses DEFAULT_CONFIG.	2026-06-19 11:57:43 -07:00
alt-glitch	b6e2a54a94	fix(mcp): address adversarial review round 1 (cache parity, gates, races) Consolidated findings from three independent reviewers (Codex, Claude Code, a Hermes subagent w/ the hermes-agent-dev skill): - BLOCKING: refresh_agent_mcp_tools rebuilt only the registry subset, silently dropping post-build-injected memory-provider (mem0/honcho/…) and context- engine (lcm_) tools on every refresh. Now additive-preserving: re-applies the same injectors agent_init uses, staged on locals and published atomically. - Re-injection now honors the #5544 enabled_toolsets gate for context-engine tools, so a restricted-toolset platform can't get lcm_ leaked back in. - Atomic read-diff-publish under one lock: the returned `added` set and the (tools, valid_tool_names) pair are consistent even under concurrent callers (no half-swap, no TOCTOU). - background_review fork opts out (_skip_mcp_refresh) so its byte-identical tools[] cache parity with the parent is preserved. - CLI /reload-mcp routed through the shared helper (was a 4th divergent copy with the same clobber bug + missing disabled_toolsets). - Explicit reloads (TUI RPC + CLI) pass enabled_override so a server the user just enabled in config this session is picked up; automatic paths reuse the agent's build-time selection. - mcp_discovery_timeout default 5.0 -> 1.5s: correctness now comes from the between-turns refresh, so the startup wait is only a small turn-1 UX bump rather than a heavy dead-server latency penalty. - has_registered_mcp_tools checks registered TOOLS (not connected servers) so a zero-tool/prompt-only server doesn't make the per-turn hook fire forever. - Tests: rewrote the thread-safety test to actually exercise the write path (alternating tool sets), added the #5544-gate regression, the memory/context preservation regression, and a "callable next turn via valid_tool_names" contract; removed a dead monkeypatch line.	2026-06-19 11:57:43 -07:00
alt-glitch	93d6e73028	fix(mcp): expose late-connecting MCP tools to the agent (TUI/CLI/gateway) MCP servers that connect after the agent's one-time tool snapshot were invisible for the whole session. Two root causes, fixed together: 1. The startup discovery wait was a flat 0.75s. HTTP/OAuth servers commonly take 2-6s on a cold connect, so they missed the window and their tools never entered the agent's snapshot. `thread.join(timeout)` already returns the instant discovery completes, so raising the bound costs ~0s for the common case (no MCP / fast servers) and only ever blocks for a genuinely-pending server, capped so a dead server can't freeze startup. The bound is now configurable via `mcp_discovery_timeout` (config.yaml, default 5.0s). 2. Three call sites duplicated the agent tool-snapshot rebuild (the TUI `reload.mcp` RPC, the gateway reload, and the TUI late-binding refresh thread), and the late-refresh detected changes by tool COUNT — missing an equal-size add/remove swap. Consolidated into one shared `tools.mcp_tool.refresh_agent_mcp_tools(agent)` helper that diffs by tool NAME, mutates the agent under a lock (thread-safe), and respects the agent's own enabled/disabled toolsets. The late-binding refresh keeps its pre-first-turn cache-safety guard: it never rebuilds the tool list once a turn has started, so the cached prompt prefix is never invalidated mid-conversation. Tests: new tests/tools/test_refresh_agent_mcp_tools.py covers the name-based diff, in-place mutation, agent-scoped filtering, thread safety, and the config-driven discovery bound (incl. instant-return when nothing is pending). 75 passed across the touched areas.	2026-06-19 11:57:43 -07:00
Teknium	2a5e9d994a	Merge pull request #48275 from NousResearch/feat/cron-scheduler-provider-chronos feat(cron): pluggable CronScheduler interface + Chronos managed-cron provider (scale-to-zero)	2026-06-19 07:51:59 -07:00
Ben	1928aa0443	fix(managed-scope): honor managed scope in config→env bridges too Manual verification surfaced a second bypass class beyond the standalone config loaders: several code paths bridge config.yaml values into os.environ (HERMES_TIMEZONE, HERMES_REDACT_SECRETS, HERMES_MAX_ITERATIONS, TERMINAL_*, network.force_ipv4, ...) by reading the raw user YAML, so the env the whole process reads carried the USER's value even when an administrator pinned it — e.g. a managed timezone was overridden because gateway/run.py wrote the user's timezone into HERMES_TIMEZONE, and _resolve_timezone_name() checks the env var first. Wired the shared apply_managed_overlay() into every config→env bridge: - gateway/run.py module-level startup bridge (timezone, redact_secrets, max_turns, terminal, display, gateway.strict, ...) - gateway/run.py _reload_runtime_env_preserving_config_authority (the per-turn re-bridge that keeps config authoritative over reloaded .env — must keep MANAGED authoritative on every turn, not just startup) - hermes_cli/main.py early security.redact_secrets / network.force_ipv4 bridge (runs before load_config is usable, at import time) - hermes_cli/send_cmd.py top-level scalar config→env bridge Verified end-to-end against a writable managed dir (12/12 checks incl. timezone, logging, model, skin, gateway settings, write-guard) and in a clean process the gateway per-turn bridge writes HERMES_TIMEZONE=<managed>. Adds an order-independent regression test for the bridge overlay.	2026-06-19 07:46:33 -07:00
Ben	b0e47a98f9	fix(managed-scope): honor managed scope in all standalone config loaders The skin bug was one instance of a class: several subsystems build their config dict directly from config.yaml instead of routing through hermes_cli.config.load_config (which carries the managed merge), so they silently ignored administrator-pinned values. Audited every config.yaml reader and fixed the behavioral-read bypasses: - gateway/config.py load_gateway_config (messaging gateway: session_reset, quick_commands, stt, model, ...) - gateway/run.py _load_gateway_config (its read_raw_config fast path also skipped the merge — read_raw_config returns raw user YAML) - tui_gateway/server.py _load_cfg (new TUI + desktop backend: skin, reasoning_effort, service_tier, provider_routing) - cron/scheduler.py (scheduled-job model/reasoning/toolsets/provider_routing) - hermes_logging.py (logging.level/max_size_mb/backup_count) - hermes_time.py (timezone) - hermes_cli/doctor.py (memory-provider diagnostic reads effective config) All route through a new shared managed_scope.apply_managed_overlay() helper that mirrors _load_config_impl (env-only expansion so a user ${VAR} can't shadow a managed literal, root-model-string normalization, leaf-merge) and is fail-open. cli.py's earlier inline fix is refactored onto the same helper. Write-back paths (slash_commands, telegram/yuanbao dm_topics, profile distribution) are deliberately left reading raw user YAML — overlaying managed values there would persist them into the user file. The dashboard (web_server.py) already routes through load_config and needed no change. TUI loader caches the RAW config so _save_cfg never writes managed values to disk. Adds test_managed_scope_overlay.py (helper) and test_managed_scope_loaders.py (per-surface integration); mutation-checked.	2026-06-19 07:46:33 -07:00

1 2 3 4 5 ...

2921 commits