hermes-agent

mirror of https://github.com/NousResearch/hermes-agent.git synced 2026-06-23 10:42:00 +00:00

Author	SHA1	Message	Date
Teknium	5ff11a689b	feat(cli): /timestamps command + timestamps in /history (#50506 ) display.timestamps already drove the [HH:MM] suffix on live submitted and streamed message labels, but there was no runtime command to toggle it and /history ignored the setting entirely. Add /timestamps [on\|off\|status] (alias /ts) and render [HH:MM] in /history for turns that carry a stored unix timestamp (resumed sessions). Live unsaved turns without a stored time are never given a fabricated one. Uses the existing sanctioned non-wire 'timestamp' message key (stripped before the API call in chat_completions), so message-alternation and prompt-cache invariants are untouched.	2026-06-21 22:44:25 -07:00
Shannon Sands	5dae502b86	Address email pairing review feedback	2026-06-21 22:43:57 -07:00
Shannon Sands	2455e1801b	Make email pairing opt-in	2026-06-21 22:43:57 -07:00
Shannon Sands	4b09903de5	fix Nous auth refresh for idle agents	2026-06-21 22:43:48 -07:00
teknium1	4314d451ca	fix(gateway): accept any inbound file type across all messaging platforms Authorization to message the agent is the gate, not the file extension. Previously the inbound-attachment allowlist (SUPPORTED_DOCUMENT_TYPES) was opt-OUT on Discord (allow_any_attachment defaulted false) and had no bypass at all on Telegram/Slack — so an .html (or any non-allowlisted type) was dropped or hard-rejected before the agent saw it. Now every authorized upload is cached and surfaced to the agent regardless of type: - base.cache_media_bytes(): unknown types cache as octet-stream (or the caller-supplied MIME) instead of returning None — fixes the chokepoint that Teams/Telegram-media route through. - discord/telegram/slack adapters: removed the allowlist reject/skip; any non-media attachment is typed DOCUMENT and cached. Known types keep their precise MIME. - Text inlining now gates on a shared _TEXT_INJECT_EXTENSIONS set (text + code + config + markup) instead of a blind UTF-8 decode, so binary formats (PDF/zip/docx) with ASCII headers are never inlined. - gateway/run.py emits the path-pointing context note for every DOCUMENT, including non text/application MIME types. - discord.allow_any_attachment is now a documented no-op kept for config back-compat. Validation: 357 gateway tests pass; E2E confirms .html/.bin/custom types cache, known types stay precise, PDFs are not inlined.	2026-06-21 22:43:45 -07:00
Ben Barclay	6202fdfc35	fix(container): detect dashboard role under s6-overlay v3 (#49196 ) (#50600 ) * fix(gateway): walk /proc//cmdline to find main-wrapper.sh under s6-overlay v3 (#49196) (cherry picked from commit `3a108c2df0`) fix(container): peel s6-v3 rc.init prefix so dashboard role is detected kyssta-exe's preceding commit (#49238) fixed _read_container_argv() to locate the rc.init-launched main-wrapper.sh process under s6-overlay v3, but the skip still never fired: _strip_container_argv_prefix() only peeled a prefix when args[0] was init/main-wrapper.sh/hermes. Under s6 v3 the matched argv is /bin/sh -e /run/s6/basedir/scripts/rc.init top /opt/hermes/docker/main-wrapper.sh dashboard ... so args[0] stayed /bin/sh, _is_dashboard_container() returned False, and the dashboard container reconciled + started its own gateway-default — the exact dual Telegram getUpdates 409 in issue #49196. Fix: strip everything up to and including the main-wrapper.sh token (the stable boundary the image owns), covering both the v2 (/init ...) and v3 (/bin/sh ... rc.init top ...) shapes with one rule, instead of matching launcher tokens positionally. This also repairs _is_legacy_gateway_run_request() under v3, which shares the same strip helper (the issue called this out). Tests: extend the dashboard true/false parametrize sets with the s6-v3 argv shape, and add test_main_skips_reconcile_in_dashboard_container_s6v3 exercising main() end-to-end with the v3 argv. Verified via mutation that both new v3 assertions fail under the old positional strip and pass with the fix. --------- Co-authored-by: kyssta-exe <kyssta-exe@users.noreply.github.com>	2026-06-22 15:35:38 +10:00
Teknium	e448b21414	feat(dashboard): interactive auth setup on no-provider non-loopback bind (#50551 ) When `hermes dashboard --host 0.0.0.0` is run interactively with the auth gate engaged but no DashboardAuthProvider configured, prompt to set up the bundled username/password provider on the spot (or point at `hermes dashboard register` for OAuth) instead of only emitting the fail-closed error. - main.py: `_maybe_setup_dashboard_auth_interactively()` runs before start_server. No-ops on loopback binds, when a provider is already registered, or when stdin/stdout isn't a TTY (Docker/s6, CI, piped runs) so the fail-closed SystemExit stays the backstop for unattended deploys. On the password path it writes dashboard.basic_auth.{username,password_hash,secret} to config.yaml (scrypt hash, never plaintext), then force-rediscovers plugins so the basic provider registers before the gate check. - web_server.py: fix the fail-closed hint — it told operators to set `dashboard_auth.basic.username` but the provider reads `dashboard.basic_auth`. - docs: note the interactive setup under Fail-closed semantics. No new env vars; reuses the existing dashboard.basic_auth config surface.	2026-06-21 20:21:48 -07:00
Teknium	9e96e70995	feat(cli): /prompt — compose your next prompt in $EDITOR (#50509 ) * feat(cli): /prompt — compose your next prompt in $EDITOR Adds /prompt (alias /compose): opens $VISUAL/$EDITOR on a temp markdown file so you can hand-edit a multi-line prompt, then sends the saved buffer as the next agent turn. Text after the command pre-seeds the buffer; an empty save cancels. Reuses the one-shot _pending_agent_seed the interactive loop already consumes (same mechanism as /blueprint), so no changes to the input event loop or message pipeline. CLI-only. * feat(tui): /prompt slash command opens $EDITOR (parity with CLI) The TUI already opens $EDITOR via Ctrl+G (openEditor), but had no /prompt slash command like the classic CLI. Wire openEditor into the slash handler context and register /prompt (alias /compose) to call it; inline text after the command is dropped into the composer first so it carries into the editor, matching the CLI's /prompt <text>.	2026-06-21 20:21:33 -07:00
Teknium	95d53c3bcb	feat(cli): /reasoning full — show complete thinking, not 10-line clamp (#50499 ) * feat(cli): /reasoning full to show complete thinking, not 10-line clamp The post-response Reasoning recap box hard-clamped long thinking to the first 10 lines, so there was no way to see the full reasoning trace after a turn (live streaming already shows it in full). Add display.reasoning_full (default off) plus /reasoning full\|clamp to toggle it at runtime; the clamp truncation note now points at the command. Addresses repeated user requests to show all thinking tokens. * test(gateway): de-snapshot /reasoning help assertion The test froze the exact args-hint literal '/reasoning [level\|show\|hide]', which the new full/clamp args change to '[level\|show\|hide\|full\|clamp]'. Convert to an invariant: assert /reasoning is in help and carries its core args, not the exact hint string. * feat(tui): /reasoning full\|clamp parity in tui_gateway The classic-CLI reasoning_full toggle had no TUI equivalent — typing /reasoning full in the TUI fell through to parse_reasoning_effort and errored. The TUI renders thinking as an expand/collapse section (no fixed 10-line recap), so map full -> sections.thinking=expanded (raw, uncapped via thinkingPreview mode='full') and clamp -> collapsed, persisting display.reasoning_full for cross-surface config consistency.	2026-06-21 20:21:11 -07:00
Teknium	7130d60861	feat(providers): remove google-gemini-cli + google-antigravity OAuth providers (#50492 ) * feat(providers): remove google-gemini-cli + google-antigravity OAuth providers Google now actively bans accounts for third-party tools that piggyback on Gemini CLI / Antigravity / Code Assist OAuth, and because abuse prevention sits at a backend layer the ban can extend to the entire Google account (Gmail/Drive), with a second violation being permanent. Ref: https://github.com/google-gemini/gemini-cli/discussions/20632 Removes both OAuth inference providers entirely (modules, provider profiles, auth/runtime/config/models wiring, the /gquota Code Assist quota command, the antigravity-cli optional skill, desktop + docs surface in en + zh-Hans). The API-key 'gemini' provider (GOOGLE_API_KEY/GEMINI_API_KEY against generativelanguage.googleapis.com) is unaffected and stays fully supported. * fix(skills): keep the antigravity-cli skill — only the OAuth provider is removed The antigravity-cli optional skill orchestrates the external `agy` binary as a coding-agent tool via the terminal tool — it does NOT wrap Hermes inference through the banned google-antigravity OAuth provider, so it carries none of the account-ban risk that motivated removing that provider. Restore the skill, its docs page, the sidebar entry, and the optional-skills catalog row. The google-antigravity / google-gemini-cli inference providers stay fully removed.	2026-06-21 19:53:27 -07:00
Teknium	5bf23ff251	fix(banner): don't advertise toolsets/skills the agent wasn't given (#50497 ) The welcome banner's 'Available Tools' merged in every toolset from the global check_tool_availability() registry walk, regardless of whether it was enabled for the current platform. On a Blank Slate CLI (file + terminal only) that surfaced discord / feishu / kanban tools the agent was never actually given — they are not in the agent's tool schema, but the banner displayed them, making it look like they were exposed. - Filter the unavailable-toolset merge to toolsets actually in enabled_toolsets (a toolset that's enabled but has unmet deps still legitimately shows as disabled/lazy). - Gate the 'Available Skills' section on the skills toolset being enabled — when it's off, the agent can't load any skill, so show 'Skills toolset disabled' instead of the on-disk catalog. When enabled_toolsets is empty (older callers), behavior is unchanged. Validation: blank-slate banner now shows only file + terminal and 'Skills toolset disabled'; a skills-enabled banner still lists the catalog. Added regression tests; full banner suite green (15/15).	2026-06-21 19:08:54 -07:00
teknium1	8cecaf0b29	feat(process): escalate SIGTERM->SIGKILL on host-pid termination after grace A daemon that ignores or stalls in its SIGTERM handler currently survives the process-registry reap and leaks until reboot (observed as agent-browser daemons accumulating to EMFILE on long-running gateways). _terminate_host_pid now snapshots the tree, SIGTERMs it, waits a bounded grace window (terminal.daemon_term_grace_seconds, default 2.0s, 0 disables), then SIGKILLs any survivor. The recycled-PID identity guard still gates the whole path, so escalation never reaches a stranger; Windows is unchanged (taskkill /F is already a hard kill). Config lives in config.yaml (terminal.daemon_term_grace_seconds), NOT an env var, per the .env-secrets-only policy. Implements the SIGKILL-escalation idea from @tkwong's #15008, reworked onto the current _terminate_host_pid tree-kill path (the original predated it) and config-gated instead of env-var-gated. Co-authored-by: Benjamin Wong <tkwong@inspiresynergy.com>	2026-06-21 19:08:52 -07:00
teknium1	41fe086eb6	style(security-audit): add explicit encoding to read_text calls (ruff PLW1514)	2026-06-21 19:05:27 -07:00
teknium1	f45ace9318	feat(security): startup security posture audit (warn-on-load) Surface dangerous host/deployment posture at gateway startup so operators get the 'you're exposed' signal the June 2026 MCP-config persistence campaign victims never had. Warn-only — never blocks startup, never raises. Checks (each independently fail-safe): - Running as root (POSIX uid 0) - SSH daemon with PasswordAuthentication enabled (incl. the 'yes' default) - Running in a container with no persistent volume mount over HERMES_HOME - Network-accessible API server with no API_SERVER_KEY New module hermes_cli/security_audit_startup.py; invoked once per process from start_gateway() right after setup_logging(). Cross-platform (root/SSH checks no-op on Windows). Idea: @Cthulhu.	2026-06-21 19:05:27 -07:00
teknium1	7726ce3040	fix(security): close hermes-0day MCP-persistence attack surface Remove the dashboard --insecure auth-bypass, add an MCP persistence guard + IOC blocklist, and raise the API-server key entropy floor. Driven by the June 2026 hermes-0day campaign (r/hermesagent, live 854.media instance): scanners find exposed Hermes dashboards/API servers, drive the root agent to plant a 'command: bash' MCP entry that appends an attacker SSH key to authorized_keys, which cron + startup then re-execute every tick. - dashboard: --insecure no longer disables the auth gate. should_require_auth returns True for every non-loopback bind; a public bind ALWAYS requires an auth provider (bundled password provider or OAuth). --insecure kept as a warned no-op for backward compat. Fail-closed error now points at the password provider, not at --insecure. - mcp_security: validate_mcp_server_entry now also rejects shell payloads that write to OS persistence surfaces (authorized_keys/.ssh/pam.d/sudoers/cron/ rc files) and hard-rejects a hermes-0day IOC blocklist (attacker SSH key + source IPs) anywhere in command/args/env. Runs at save AND spawn time. - api_server: raise network-bind API_SERVER_KEY entropy floor 8->16 chars; warn when a network-accessible API server runs an unsandboxed local backend.	2026-06-21 19:05:27 -07:00
Teknium	84e1d31e54	refactor(kanban): fold worker/orchestrator skills into injected guidance (#50473 ) The kanban-worker and kanban-orchestrator bundled skills existed only to be force-loaded into dispatcher-spawned workers, gated by environments:[kanban] so they wouldn't leak into normal CLI listings. That gating was fragile (the leak that #50443 patched) and the --skills auto-load was already best-effort — most workers ran without it because the bundled skill isn't present in profile-scoped skills dirs. Remove the skills entirely and promote their load-bearing content (workspace kinds, deliverable artifacts, created-card integrity, profile discovery) into KANBAN_GUIDANCE, which is already injected into every kanban worker's system prompt. Net result: every worker reliably gets the guidance, nothing can leak into a CLI/blank-slate session, and the gating machinery is gone. - agent/prompt_builder.py: promote the 4 load-bearing rules into KANBAN_GUIDANCE - hermes_cli/kanban_db.py: drop --skills kanban-worker auto-injection + _kanban_worker_skill_available probe - hermes_cli/kanban_swarm.py: drop skills=[kanban-orchestrator] on the root card - hermes_cli/kanban.py: drop kanban-init skill seeding; fix help text - delete skills/devops/kanban-{worker,orchestrator} - docs: delete the two skill pages (EN+zh), fix sidebars/catalog/kanban.md/kanban-worker-lanes.md and the video-orchestrator + codex-lane references - tests: update spawn-argv expectations; re-bound the guidance-size guard Supersedes the skill-leak half of #50443 (credit @helix4u for flagging the area).	2026-06-21 17:06:48 -07:00
Teknium	c768c4b71c	fix(antigravity): move model flow to model_setup_flows + stop bare-alias hijack CI on the salvage caught two issues the stale PR base masked: 1. The model-setup flows were extracted from main.py into hermes_cli/model_setup_flows.py after @pmos69 forked. The cherry-pick re-introduced a stale _model_flow_custom into main.py (duplicating the one main.py now imports) and put _model_flow_google_antigravity there too. Move the antigravity flow into model_setup_flows.py alongside its siblings and drop the stale _model_flow_custom dup. Fixes the getpass/stdin OSError in tests/cli/test_cli_provider_resolution.py. 2. google-antigravity re-exposes Claude/Gemini/GPT-OSS models, so its catalog was hijacking bare short aliases (`sonnet` -> google-antigravity instead of anthropic) in detect_static_provider_for_model via dict insertion order. Add _BORROWED_MODEL_PROVIDERS and defer those providers to a last-resort pass so a model's native vendor always wins alias/direct-catalog detection. Fixes tests/hermes_cli/test_models.py::test_short_alias_resolves_to_static_model.	2026-06-21 16:41:30 -07:00
pmos69	8baa4e9976	feat(cli): add native Antigravity OAuth provider	2026-06-21 16:41:30 -07:00
Teknium	824c9d3812	fix(config): alias model.api_base -> model.base_url for custom providers (#50385 ) A bare custom provider configured via `model.api_base` (the intuitive name OpenAI-SDK / LiteLLM users reach for) was silently ignored: `hermes config set` accepts any dotted key, so `model.api_base` got written and confirmed, but the runtime resolver reads only `model.base_url`. Requests fell back to OpenRouter with an empty key -> 401, zero hits to the custom endpoint (issue #8919). Now api_base is migrated to base_url at load time (fixes existing broken configs) and at set time (with a notice), never overriding an explicit base_url. Closes #8919.	2026-06-21 13:33:41 -07:00
Teknium	bb77a8b0d5	fix(gateway): respawn unmapped Windows gateways after update (#50090 ) (#50373 ) On Windows, _pause_windows_gateways_for_update() force-kills every running gateway before mutating the venv. Gateways mapped to a profile (via profile.path/gateway.pid) were respawned afterward, but gateways with NO profile mapping — e.g. a Windows Scheduled Task running "pythonw.exe -m hermes_cli.main gateway run" — were force-killed and only told to restart manually. After an auto-update/bootstrap the Telegram bot stayed dead until manual intervention. Now we snapshot each unmapped gateway's argv (psutil, guarded by looks_like_gateway_command_line) before the kill and replay it through the same detached watcher used for profile gateways, so unmapped gateways come back automatically too. Co-authored-by: Hermes Agent <agent@nousresearch.com>	2026-06-21 13:33:26 -07:00
memosr	ed3d12a762	fix(security): fail-closed when WebSocket peer is empty in loopback mode Per @egilewski's audit on this PR (#15544), the original fix was correct but the file has refactored since: the four endpoint-local empty-peer checks have been consolidated into _ws_client_is_allowed and _ws_client_reason, but the helpers were left fail-open ('no peer host known means allow' / 'no reason to block'). On a loopback-bound dashboard with auth disabled, an ASGI server behind a misconfigured proxy or a unix-socket transport can deliver ws.client == None or ws.client.host == ''. The helpers were treating that as 'allowed', so the loopback-only peer gate could be bypassed by anything that suppressed the client tuple in transit. All four WebSocket endpoints (/api/pty, /api/ws, /api/pub, /api/events) route through _ws_request_is_allowed -> _ws_client_is_allowed, so the gap applied uniformly. Fix: * _ws_client_is_allowed: return False when client_host is empty instead of True. Only reached on loopback bind with auth disabled (auth_required=True and explicit non-loopback binds short-circuit earlier), so the fail-closed behavior is scoped to the surface that needs it. * _ws_client_reason: return a 'missing_or_empty_peer bound=...' block reason instead of None, so the dispatcher's existing reason-based rejection path picks it up and the close gets logged with a machine-parseable token for diagnosability. Behavior unchanged for: * gated mode (auth_required=True) — early-returns True before the empty-peer check runs. The OAuth ticket is the auth at that point. * explicit non-loopback bind (--host 0.0.0.0/::, or a specific LAN address, always with --insecure) — early-returns True before the empty-peer check runs. DNS-rebinding is still blocked by the Host/Origin guard in _ws_host_origin_is_allowed. * legitimate loopback peers (client_host == '127.0.0.1' / '::1') — not affected by the empty-peer branch. Regression tests added in tests/hermes_cli/test_dashboard_auth_ws_auth.py: * test_empty_client_host_rejected_in_loopback_mode * test_missing_client_object_rejected_in_loopback_mode * test_empty_client_host_reason_is_block Plus two regression guards to ensure the fix does not over-reach: * test_empty_client_host_still_allowed_in_insecure_public_mode * test_empty_client_host_still_allowed_in_gated_mode All three new fail-closed tests fail without this patch (the helpers return True / None for an empty peer) and pass with it. The 45 pre-existing tests in test_dashboard_auth_ws_auth.py continue to pass.	2026-06-21 13:33:18 -07:00
teknium1	6902eb3913	fix(cli): make ZIP-update directory replace atomic so it can't delete ui-tui Root cause of #49145: the Windows ZIP-update path did rmtree(dst) then copytree(src, dst). If the copy failed partway — common on that path, which only runs because file I/O is already flaky on the machine — the directory was left deleted with nothing copied back. ui-tui/ vanishing is what broke 'hermes --tui' (WinError 267), but the bug hit every top-level directory. _atomic_replace_dir stages the new copy into a sibling temp dir and only swaps it in on full success, restoring the original on failure. A failed update now leaves the live tree untouched instead of half-deleted.	2026-06-21 13:10:22 -07:00
teknium1	db097fb088	fix(cli): auto-restore a deleted ui-tui workspace from git before TUI launch The Windows update path can leave tracked ui-tui/ files deleted in the working tree (HEAD intact). The guard now self-heals: when ui-tui/ is missing in a git checkout, run `git restore -- ui-tui` and continue, falling back to the printed manual-recovery steps only when git can't recover it (no checkout / restore failed). Builds on konsisumer's missing-workspace guard.	2026-06-21 13:10:22 -07:00
konsisumer	537ad9ea9a	fix(cli): guard missing ui-tui workspace before TUI launch	2026-06-21 13:10:22 -07:00
Teknium	d164ed0326	fix(kanban): make reclaim claim-lock-aware to stop task/run status desync (#50366 ) After a worker crash + reclaim + respawn, the board could show a task in the Ready lane while its task_run was 'running' and the new worker was actively executing (#36910). The dispatcher could then treat live work as available and double-assign. Root cause: the three reclaim paths (detect_crashed_workers, release_stale_claims heartbeat-stale backstop, enforce_max_runtime) each snapshot a task's worker_pid/claim_lock, do liveness work, then reset tasks.status back to 'ready' with only a 'WHERE status=running' guard. If the task was reclaimed AND re-claimed by a NEW worker in between (new run, new claim_lock, live pid), the stale UPDATE clobbered the live task: status flipped to 'ready' while the fresh run stayed 'running'. claim_task is the only writer that sets status='running', so nothing put it back — permanent desync. Fix: gate each reset on the snapshot's claim_lock (and worker_pid where available) so it only fires when the task is still owned by the worker the reclaim was computed for. A stale reclaim now no-ops (rowcount 0) instead of desyncing a re-claimed task. Genuine crashes (lock still matches) reclaim exactly as before. This is the same race class the in-gateway dispatch lock (single-writer ticks) mitigates, closed at the row level so a single dispatcher's fast reclaim->respawn across two ticks is also safe. Closes #36910.	2026-06-21 12:49:07 -07:00
memosr	ae46699905	fix(security): validate snapshot_id and file paths in restore_quick_snapshot to prevent path traversal	2026-06-21 12:44:22 -07:00
Teknium	84ba83b09a	fix(kanban): bound the cross-process init lock so connect() can't hang forever (#50353 ) connect() wrapped its entire body in an unbounded blocking flock(LOCK_EX) on every call (_cross_process_init_lock). A single process stalled inside the critical section — or a stale lock held by a wedged worker — blocked every other connect(), including the long-lived gateway dispatcher's next-tick connect, forever. No timeout, no traceback, no recovery: the board silently stopped being worked until a manual restart (issue #36644). Two fixes: 1. Fast-path skip: once THIS process has initialized a path, the expensive first-open work (header validation, integrity probe, schema + additive migrations) is already cached in _INITIALIZED_PATHS. The steady-state connect has nothing for the cross-process lock to protect, so it now opens the connection (WAL + pragmas) under only the cheap in-process _INIT_LOCK and never touches the file lock. This removes the lock from the dispatcher's hot path entirely — a stalled external 'hermes kanban list' can no longer block ticks. 2. Bounded acquire: even on first-init, _cross_process_init_lock now retries a non-blocking acquire up to a 10s deadline, then logs a WARNING and proceeds WITHOUT the cross-process lock. Safe because the in-process _INIT_LOCK still serializes same-process threads and the init work is idempotent (CREATE TABLE IF NOT EXISTS + additive migrations) — worst case is redundant work, not corruption. A bounded 'proceed anyway' beats an unbounded hang. Windows path switched LK_LOCK -> LK_NBLCK (non-blocking) to match. Closes #36644.	2026-06-21 12:43:41 -07:00
Teknium	9630ec6c19	fix(kanban): pin worker TERMINAL_CWD to the task workspace (#50348 ) _default_spawn launched the worker subprocess with cwd=workspace and set HERMES_KANBAN_WORKSPACE, but never set TERMINAL_CWD — so the worker inherited the dispatching gateway's TERMINAL_CWD. That value takes precedence over the process cwd in two places: - tools/file_tools.py::_resolve_base_dir — a relative write_file path resolved against the gateway user's home instead of the workspace, so artifacts silently landed outside the workspace (#41312). - agent_init's context-file loader — AGENTS.md was discovered relative to the gateway's cwd, so under multi-profile dispatch a worker loaded whichever gateway won the claim race's AGENTS.md, not the task's (#34619). Both are the same root cause. Pinning TERMINAL_CWD to the workspace (where the task's work actually happens) fixes both. Guarded on an existing absolute dir because file_tools rejects relative/sentinel TERMINAL_CWD values — a non-dir workspace leaves the inherited value rather than writing a meaningless one. Closes #34619, closes #41312.	2026-06-21 12:43:37 -07:00
Teknium	e217fd42e2	feat(kanban): add task lifecycle plugin hooks (claimed/completed/blocked) (#50349 ) Plugins could observe session/tool/approval lifecycle but had no way to observe kanban task transitions. Adds three observer hooks fired by the board's claim/complete/block transitions: - kanban_task_claimed (dispatcher process, before worker spawn) - kanban_task_completed (worker process, carries summary) - kanban_task_blocked (worker process, carries reason) Each fires AFTER the DB write txn commits, so a plugin observes durable state and a slow/hanging callback can never hold the SQLite write lock. All firing is best-effort: a raising hook is logged and swallowed and never breaks a board transition. profile_name is resolved from HERMES_HOME so dispatcher- and worker-side hooks carry the right profile. Requested by @Smithangshu on Discord.	2026-06-21 12:38:14 -07:00
Teknium	9d883ac90e	feat(plugins): add ctx.profile_name for session-agnostic profile access (#50346 ) Plugins previously had no way to read the active profile name from the PluginContext. The workaround in the wild — reaching into ctx._manager._cli_ref — only works in an interactive CLI session; _cli_ref is None in the gateway and in kanban-spawned worker sessions (hermes -p <profile> chat -q ...), so the workaround breaks exactly where multi-profile awareness matters most. ctx.profile_name wraps hermes_cli.profiles.get_active_profile_name(), which derives the name from HERMES_HOME and therefore works in every execution context with zero dependency on _cli_ref.	2026-06-21 12:38:11 -07:00
joaomarcos	475e81dab4	fix(web_server): use run_in_executor for gateway pre-warm and drain-timeout Fixes a regression introduced by the prior approach (synchronous import hermes_cli.gateway inside _lifespan) that caused a new failure mode: the blocking import stalled the asyncio event loop before uvicorn could bind its port, pushing HERMES_DASHBOARD_READY past the desktop shell's 45 s announcement deadline and triggering a respawn loop that accumulated orphaned backend processes. Two-part fix: _lifespan: replace the blocking import with a fire-and-forget run_in_executor call (_warm_gateway_module). The import runs in a worker thread while the server socket is already open, so HERMES_DASHBOARD_READY fires without delay. get_status: replace the inline lazy import with await run_in_executor(None, _resolve_restart_drain_timeout). This is the root fix for the original 15 s socket-timeout: the blocking .pyc-compilation + Defender scan is offloaded to a thread, keeping the event loop free for every /api/status probe. After the first call the module is in sys.modules and the executor returns in microseconds. Both helpers are extracted as module-level sync functions so they can be unit-tested independently of FastAPI or uvicorn. Closes #50209 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-21 12:29:18 -07:00
Teknium	e581740aa1	fix(kanban): single-writer dispatch lock to prevent orphan-dispatcher DB corruption (#50331 ) A shell-launched 'hermes gateway run --replace' / 'gateway restart' on a systemd/launchd host can leave an orphan gateway whose kanban dispatcher escapes the service cgroup, survives 'systemctl restart', and becomes a second long-lived writer on the shared kanban.db. Two dispatchers that each believe they own the file both pass SQLite busy_timeout and then race on WAL frames — the documented root cause of multi-writer corruption (issue #35240). The existing _guard_supervised_gateway_conflict startup guard blocks the common way an orphan is born, but does nothing once a second dispatcher already exists. This adds the defense-in-depth: dispatch_once now wraps every tick in a non-blocking, board-scoped flock (_dispatch_tick_lock). A losing dispatcher returns DispatchResult(skipped_locked=True) and does zero DB writes this tick — so two dispatchers can never run a reclaim/spawn/write sequence concurrently regardless of how the second one got there. - Non-blocking (LOCK_NB): never stalls the gateway's async watcher. - Board-scoped: lock file is a .dispatch.lock sibling of each board's kanban.db, so unrelated boards tick in parallel. - POSIX + Windows (fcntl / msvcrt LK_NBLCK), no-op degrade where neither exists — mirrors the existing _cross_process_init_lock pattern. Verified with a real two-process orphan repro: while a separate process holds the lock, dispatch_once skips; after release it runs.	2026-06-21 12:06:24 -07:00
Teknium	587b5b9ac2	fix(backup): capture memory-provider state stored outside HERMES_HOME (#50325 ) hermes backup only walks HERMES_HOME, so memory providers that keep config/credentials in home-anchored dotdirs (honcho -> ~/.honcho, hindsight -> ~/.hindsight, openviking -> ~/.openviking) lost that data across a backup/import cycle — the peer IDs, session pairings, and API keys never made it into the archive. Add an optional MemoryProvider.backup_paths() hook (default []). The active provider declares its external paths; backup resolves them from config only (no init, no network), archives the ones under the home dir into a reserved _external/ subtree encoded relative to home, and import restores them to their original location with a home-anchored traversal guard and 0600 on credential-shaped files. Paths outside home are skipped as non-portable. honcho, hindsight, and openviking override the hook. E2E-validated full backup->import cycle plus 7 new tests.	2026-06-21 12:03:46 -07:00
kn8-codes	6183e8ce1b	fix(telegram): make Bot API 10.1 rich messages opt-in (default off) Rich messages are not ready for primetime: current Telegram clients can render Bot API 10.1 rich messages as blank/unsupported bubbles and make them hard to copy as plain text, which is worse than the legacy MarkdownV2 path for command snippets and mobile handoffs. Default the rich_messages toggle to False so replies stay on the copyable legacy path; users opt in per bot via platforms.telegram.extra.rich_messages: true. Updates adapter, gateway config default, example config, English + zh-Hans docs, and the default/opt-in tests.	2026-06-21 12:03:24 -07:00
sgaofen	93ea9b04af	fix(gateway): cap inbound media download size to prevent memory exhaustion Inbound image/audio/video payloads were buffered fully into process memory before being written to the cache, with no size limit. A large upload (Discord Nitro allows 500 MB) or a remote media URL in an inbound message pointing at a huge file could spike RAM and OOM-kill the gateway. Enforce a configurable cap in the shared cache helpers (gateway/platforms/ base.py) so the protection holds across every platform adapter, not one: - cache_image/audio/video_from_bytes reject oversized payloads before writing (video was the gap in the original report — now covered). - cache_image/audio_from_url stream the body, rejecting on an oversized Content-Length header and re-checking the running total per chunk so an absent/lying header can't smuggle an unbounded body past the cap. - Discord's _read_attachment_bytes checks att.size up front, so an oversized attachment is rejected before any bytes are pulled into memory. Configurable via gateway.max_inbound_media_bytes in config.yaml (default 128 MiB; 0 disables). No new env var — non-secret config lives in config.yaml. Salvaged and extended from @sgaofen's PR #13341 (the original report and the shared-helper approach). Reapplied onto current main (Discord adapter has since moved to plugins/platforms/discord/), the configurable knob moved from an env var to config.yaml, and the video cache helper added. Co-authored-by: Hermes Agent <noreply@nousresearch.com>	2026-06-21 11:56:46 -07:00
Teknium	a18bae65b9	fix(config): redact api_key in config show/set output (#50245 ) (#50313 ) hermes config show printed the model dict raw via print(), bypassing the logging redactor; a custom-provider api_key (e.g. Cloudflare cfut_...) was shown in plaintext even with security.redact_secrets=true. Opaque tokens don't match any vendor-prefix regex, so structural key-name masking is required. - Add redact_config_value(): recursively masks credential-shaped keys (api_key/token/secret/... exact-match) via mask_secret. - Wrap the show_config model dump in it. - Mask the set_config_value echo when the leaf key is credential-shaped (config set model.api_key routes to config.yaml, lowercase misses the .env allowlist).	2026-06-21 11:50:31 -07:00
Teknium	03563dabac	fix(gateway): raise session-hygiene hard message limit 400 → 5000 (#50194 ) The gateway pre-compression hygiene valve force-compressed any session crossing 400 messages regardless of token usage. On large-context (1M+) models doing many short, message-dense turns, a healthy session at ~16% token usage could hit 400 messages and get force-compressed — and the compression summary's stale Active Task could then bleed into the next turn. The valve's actual purpose is to break a death spiral: when API calls keep disconnecting on an oversized session, no token-usage data arrives, the token threshold never fires, and the transcript grows unbounded. It's a count-based floor for that pathological case only. 400 was tuned for ~200K-context models and is far too low for modern large-context sessions. Raise the default to 5000 — still well clear of any death spiral, but no longer firing on legitimate long conversations. The value remains fully configurable via compression.hygiene_hard_message_limit.	2026-06-21 08:26:19 -07:00
Teknium	e499d69e3e	feat(api-server): configurable concurrent-run cap to prevent DoS (#50007 ) The OpenAI-compatible API server only enforced a hardcoded cap of 10 concurrent runs on /v1/runs, leaving /v1/chat/completions and /v1/responses unbounded — a request flood could exhaust CPU, memory, and upstream LLM quota (#7483). - Add gateway.api_server.max_concurrent_runs (config.yaml, default 10, 0 disables). No env var. - Shared concurrency gate across all three agent-serving endpoints, counting both the chat/responses in-flight counter and the /v1/runs stream set. Returns OpenAI-style 429 + Retry-After when at the cap. - Remove the dead hardcoded _MAX_CONCURRENT_RUNS class attribute. Closes #7483.	2026-06-21 07:26:03 -07:00
kshitijk4poor	4d7bb382b0	refactor(gateway): route all active_agents coercion through parse_active_agents; harden drain-timeout fallback Second cleanup pass (simplify-code review of the first follow-up): - write_runtime_status now clamps active_agents via parse_active_agents instead of an inline max(0, int(...)). Removes the duplicated clamp the helper's docstring acknowledged AND closes a write-side ValueError gap (a non-numeric active_agents previously raised; now degrades to 0). - hermes_cli/gateway.py draining-status line routes its active-agents count through parse_active_agents too — the third coercion site of the same persisted field, now consistent and non-raising with the two HTTP surfaces. - web_server.py /api/status: the drain-timeout resolver fallback now catches ImportError specifically and falls back to DEFAULT_GATEWAY_RESTART_DRAIN_TIMEOUT (a real float) instead of a blanket 'except Exception -> None'. None would have violated the surfaced field's int/float contract and stripped NAS's poll-deadline hint silently. - Dropped a redundant 'if runtime else 0' branch (parse_active_agents already handles the empty/None case) and tightened the parse_active_agents docstring to describe the actual single-contract role (write + both reads).	2026-06-21 17:22:52 +05:30
kshitijk4poor	b577f25100	refactor(gateway): dedupe drain-timeout resolution + share active_agents parse Follow-up cleanups on top of the busy/idle readout (PR #50103): - web_server.py /api/status reused the single drain-timeout resolver hermes_cli.gateway._get_restart_drain_timeout() (HERMES_RESTART_DRAIN_TIMEOUT env -> agent.restart_drain_timeout config -> default) instead of inlining a third hand-rolled copy of that precedence chain. Also fixes a subtle divergence: the inline copy used os.environ.get() so a set-but-empty env var was treated as a value rather than falling through to config; the shared resolver .strip()s and falls through correctly. - Added gateway.status.parse_active_agents() and routed BOTH HTTP surfaces (/api/status and /health/detailed) through it, so the exposed active_agents field is consistently clamped non-negative. Previously /api/status clamped while /health/detailed exposed the raw file value, diverging on a corrupt count. - Added TestParseActiveAgents covering the shared coercion contract.	2026-06-21 17:22:52 +05:30
Ben	0ee75469d7	feat(dashboard): surface gateway busy/drainable on /api/status Give an external consumer (NAS) a trustworthy, always-reachable busy/idle readout it can poll before a disruptive lifecycle action (restart, migrate, stop, auto-update). The dashboard /api/status is the only HTTP surface guaranteed up on a hosted agent regardless of which gateway platforms are enabled, and it already reads gateway_state.json. Add to /api/status (additive, non-breaking): - active_agents — in-flight gateway-turn count (now refreshed per-turn by the companion gateway-side commit) - gateway_busy — running AND active_agents > 0 - gateway_drainable — running and live (a valid begin-drain target) - restart_drain_timeout — resolved seconds, so the consumer can size its poll deadline without out-of-band knowledge (env HERMES_RESTART_DRAIN_TIMEOUT → config agent.restart_drain_timeout → default) The busy/drainable contract is defined once in gateway.status (derive_gateway_busy / derive_gateway_drainable) and consumed by both /api/status and /health/detailed so the two surfaces can never disagree. Liveness keys off gateway_running (a live PID/health probe), NEVER gateway_updated_at — a healthy idle gateway never advances that timestamp. All derived fields degrade to safe falsy values when the gateway is down or the status file is absent/corrupt (never a spurious "busy" that would wedge the consumer). active_sessions (the 5-min DB recency heuristic the SPA reads) is left exactly as-is — new signal, new fields. Tests (behaviour contracts, not snapshots): the pure derivation contract across every running/state/count/liveness combination; /api/status integration for busy, idle-drainable, draining, down, stale-busy-file, corrupt-count, and timeout surfacing; and /health/detailed parity.	2026-06-21 17:22:52 +05:30
kshitijk4poor	1ca29723f0	fix(cli): log instead of swallow preflight-warning errors; consistent TUI warning field Follow-up to the salvaged preflight-compression warning: - Replace silent `except Exception: pass` at all 5 guard call sites (cli.py x2, gateway/slash_commands.py x2, tui_gateway/server.py) with `logger.debug(...)` so signature drift in the guard helper isn't hidden. - tui_gateway/server.py: set the confirm dict's `warning` field to the merged message (was bare expensive-model text) so it matches `confirm_message` for any future consumer reading `warning`. - Add trailing newlines to the two new files.	2026-06-21 16:31:56 +05:30
Tuna Dev	04730f32e7	fix(cli): warn when in-session model switch will preflight-compress Adds hermes_cli/context_switch_guard.py mirroring the model_cost_guard pattern. When a user switches models mid-session (Herm TUI picker, CLI, or /model on Telegram/Discord), the warning surfaces on the existing ModelSwitchResult.warning_message path used by the expensive-model guard if the new model's compression threshold is below the current session size. Partial fix for #23767 — addresses only the 'user-facing guardrail when switching from a high-context provider to a substantially lower-context provider' slice. The other proposed fixes from that issue (hard preflight token guard, metadata cache invalidation on switch, compression safety invariant, oversized tool-output handling) are out of scope for this PR.	2026-06-21 16:29:31 +05:30
kshitij	f6a504d088	Merge pull request #50025 from NousResearch/salvage/cron-run-immediate fix(cron): execute job immediately on action=run	2026-06-21 13:53:13 +05:30
kyssta-exe	65d7c7fafd	fix(cron): execute job immediately on action='run' `cronjob(action='run')` (and `hermes cron run`) only set `next_run_at = now` and returned success, relying on the scheduler ticker to actually execute the job on its next tick. When no gateway/ticker is running — a CLI-only setup, or the Windows case in #41037 — the job never executed: `run` reported success, but `last_run_at` stayed null forever, no output, no delivery. A manual `run` should actually run. `_execute_job_now` now: - claims the job via `claim_job_for_fire` — the same at-most-once CAS the scheduler/external-provider fire path uses. This both advances `next_run_at` for recurring jobs and blocks a concurrently-running gateway ticker from double-firing the same job; if the claim is lost, the run is skipped (the tool reports `execution_skipped`). This closes the double-fire race that a bare `advance_next_run` left open (a tick whose `get_due_jobs` already captured the job between trigger and advance would still fire it). - delegates firing to `run_one_job` — the single shared execute→save→deliver→mark body the ticker and external providers use — so failure delivery, `[SILENT]` handling, and live-adapter delivery stay identical across paths and can't drift. (The original salvage re-implemented this sequence inline and had already dropped failure delivery + `[SILENT]`.) The tool response carries `executed`, `execution_success`, and either `execution_error` or `execution_skipped`. The `hermes cron run` CLI message no longer claims "It will run on the next scheduler tick" — it reports the actual "Ran now: succeeded/failed" outcome (or the skip). Salvaged from #41130 by @kyssta-exe (authorship preserved); reworked to reuse `claim_job_for_fire` + `run_one_job` per review rather than re-implementing the fire sequence inline. Adds tests for the claim-then-fire path, claim-lost skip, failure reporting, and exception capture. Fixes #41037 Co-authored-by: kyssta-exe <kyssta-exe@users.noreply.github.com>	2026-06-21 13:28:04 +05:30
annguyenNous	07424da76f	fix(cron): keep ticker alive on BaseException + heartbeat-aware status The in-process cron ticker (cron/scheduler_provider.py) caught only `Exception` and logged at DEBUG, so a `SystemExit`/`KeyboardInterrupt` raised from a misbehaving provider SDK or agent retry path killed the ticker thread silently. The gateway PROCESS stayed up, so `hermes cron status` — which only checks `find_gateway_pids()` — kept reporting "✓ jobs will fire automatically" while no jobs ever fired (#32612, #32895). This makes ticker death survivable and detectable: - The ticker loop now catches `BaseException` and logs at ERROR with a traceback, so a single bad tick no longer tears the thread down and the failure is visible in the gateway log. - The loop records a heartbeat (`cron/ticker_heartbeat`, epoch seconds) on startup and after every tick — best-effort, never raised into the loop. Both ticker entry points (the gateway and the desktop fallback in web_server.py) funnel through `InProcessCronScheduler.start`, so one heartbeat site covers both. - `hermes cron status` now reads the heartbeat age: if the gateway is running but the heartbeat is stale (> 200s, i.e. several missed ~60s ticks), it reports the ticker as STALLED and suggests a restart instead of falsely claiming jobs will fire. A missing heartbeat (older build / never ran) is treated as "unknown", not "dead". Adds tests for BaseException survival, per-iteration heartbeat recording, heartbeat round-trip/age, staleness detection, and silent-write-failure. Salvaged from #49660 (BaseException survival on current structure), extended with the heartbeat + honest-status reporting that the earlier (pre-refactor) watchdog PRs #35616 and #33849 proposed. Fixes #32612 Fixes #32895 Co-authored-by: banditburai <promptsiren@gmail.com> Co-authored-by: sweetcornna <96944678+sweetcornna@users.noreply.github.com>	2026-06-21 13:00:50 +05:30
Tortugasaur	c02648c5dd	fix(docs): align slash-command and docker docs	2026-06-20 23:23:47 -07:00
Kevin Anderson	b337afdf6e	docs(cli): fix broken terminal-backend guide link in setup wizard The terminal backend onboarding step pointed at /docs/developer-guide/environments, which no longer exists. Point it at the live docs page /docs/user-guide/configuration#terminal-backend-configuration. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-20 23:23:47 -07:00
EloquentBrush0x	9bd5003d4f	fix(spotify): quarantine dead tokens on terminal refresh failure resolve_spotify_runtime_credentials() called _refresh_spotify_oauth_state() without a try/except, so a terminal failure (HTTP 400/401, invalid_grant, refresh_token_reused) raised AuthError but left the dead refresh_token in auth.json. Every subsequent session re-read and retried the same token over the network, failing identically each time. Fix: wrap the refresh call and, when exc.relogin_required is True and a refresh_token is present, clear the dead OAuth fields (access_token, refresh_token, expires_at, expires_in, obtained_at) and write a last_auth_error quarantine marker to auth.json before re-raising. The next call sees no access_token and fails fast with spotify_access_token_missing — no network retry — and the user is prompted to re-authenticate. Mirrors the quarantine pattern already in place for Nous, xAI-OAuth, Codex-OAuth (#28116, #28118), and MiniMax-OAuth (#28119).	2026-06-20 23:23:47 -07:00
teknium1	7ace96ba40	fix(compression): preserve goal, platform, and session indexing across rotation Three state-loss bugs at the compression rotation boundary, fixed together because they all live in the same ~80-line rotation block: - #33618: a persistent /goal did not follow the rotation. load_goal does a flat per-session lookup with no lineage walk, so a goal silently died when compression minted a fresh child id. Added migrate_goal_to_session() and call it after the child session is created (move-not-copy: the parent row is archived as cleared so exactly one active goal row exists). - #33906/#33907: if the child create_session raised (FK constraint, contended write), the outer handler only warned and let the agent continue on the NEW id — which has no row in state.db — producing an orphan session. Now the rotation rolls agent.session_id back to the still-indexed parent (reopening it) instead of stranding the conversation on a phantom id. - #27633: the compaction-boundary on_session_start notification omitted the platform kwarg, so context-engine plugins saw source=unknown for every message after the boundary. Forward platform (matching the initial session-start call in agent_init.py). Co-authored-by: denisqq <21260182+denisqq@users.noreply.github.com> Co-authored-by: zccyman <16263913+zccyman@users.noreply.github.com> Co-authored-by: liuhao1024 <sunsky.lau@gmail.com>	2026-06-20 20:06:24 -07:00

1 2 3 4 5 ...

2945 commits