hermes-agent

mirror of https://github.com/NousResearch/hermes-agent.git synced 2026-06-29 11:42:04 +00:00

Author	SHA1	Message	Date
kshitijk4poor	e0272cfef2	Revert "fix(compression): make minimum context floor configurable (#31600 )" This reverts commit `cae1ee44a7`.	2026-06-25 01:04:44 +05:30
kshitij	59acaa972f	Merge pull request #52053 from NousResearch/salvage/31600-minimum-context-length-configurable fix(compression): make minimum context floor configurable (#31600)	2026-06-25 01:02:52 +05:30
Tranquil-Flow	cae1ee44a7	fix(compression): make minimum context floor configurable (#31600 ) Add compression.minimum_context_floor config key that allows users to lower the compression threshold floor below the hardcoded 64K default, preventing infinite tool-call loops on models whose structured output degrades well before 64K tokens. - agent/model_metadata.py: add get_configurable_minimum_context() helper with 16K hard safety limit - agent/context_compressor.py: accept minimum_context_floor param, thread it through _compute_threshold_tokens - agent/conversation_compression.py: use compressor's floor for aux model context validation - agent/agent_init.py: read compression.minimum_context_floor from config and pass to ContextCompressor - gateway/run.py: cache-busting includes new key Salvaged from #31686 by @Tranquil-Flow onto current main. Resolves conflicts with in-place compaction (#38763) and max_tokens threshold computation (#43547) that landed after the original PR. Closes #31600	2026-06-25 00:56:04 +05:30
liuhao1024	25e2312230	fix(memory): skip drift guard for add (append-only) action (#42874 ) The drift guard (introduced for #26045) correctly protects replace/remove from clobbering un-roundtrippable content, but it also fires on the add path. Since add only appends and never overwrites, the guard is unnecessary and causes false positives when prior add() calls in the same session shift the byte count of the on-disk file. Add skip_drift parameter to _reload_target() and pass True from add(). Replace/remove continue to use the drift guard unchanged. Salvaged from #42880 by @liuhao1024. Closes #42874	2026-06-25 00:51:12 +05:30
Jeffrey Quesnelle	b13e2fd694	Merge pull request #52044 from NousResearch/fix/install-venv-kill-venv-processes fix(install): kill venv-resident gateway before recreating venv on Windows	2026-06-24 15:16:58 -04:00
kshitij	9214aa7dde	Merge pull request #52090 from NousResearch/salvage/35994-reset-deadlock fix(gateway): offload agent cleanup off the event loop in /new reset (#35994)	2026-06-25 00:34:21 +05:30
kshitijk4poor	0225480369	fix(gateway): offload agent cleanup off the event loop in /new reset (#35994 ) The /new (and /reset) confirmation-button callback runs the slash-confirm handler on the asyncio event loop (see _request_slash_confirm). That handler calls _handle_reset_command, which invoked the SYNCHRONOUS, potentially long-blocking _cleanup_agent_resources inline: agent.close() tears down terminal sandboxes, browser daemons and background processes (subprocess waits), and shutdown_memory_provider() can make a network call. A slow teardown wedged the entire event loop, so the bot went silent and stopped processing all messages until a manual restart. Offload _cleanup_agent_resources via the existing contextvar-preserving _run_in_executor_with_context helper, bounded by asyncio.wait_for with a named _RESET_CLEANUP_TIMEOUT_S (30s). The loop is never blocked; on timeout the reset proceeds and the worker thread is left to finish on its own (it cannot be cancelled). The text /new path is unaffected (already off-loop). Tests (tests/gateway/test_35994_reset_button_deadlock.py): the loop keeps ticking while close() blocks in its worker thread; a cleanup that raises is swallowed (warning logged) and the reset still rotates the session; a cleanup that times out degrades gracefully. All three are mutation-verified to fail without their respective production branch.	2026-06-25 00:27:22 +05:30
kshitij	de281bcebc	Merge pull request #52084 from NousResearch/salvage/31884-silent-drop-after-stop fix(gateway): surface retry hint instead of silently dropping turn after /stop (#31884)	2026-06-25 00:06:32 +05:30
kshitij	5b065e32ed	Merge pull request #51051 from NousResearch/salvage/cron-provider-pin fix(cron): fail closed when an unpinned job provider drifts from creation snapshot (#44585)	2026-06-25 00:05:52 +05:30
sweetcornna	b41d9b845d	fix(gateway): surface retry hint instead of silently dropping turn after /stop (#31884 ) After /stop, the next user message can hit a stale generation token and return with api_calls=0, no failure, no interruption. _normalize_empty_agent_response fell through to an empty string, so the gateway logged "response=0 chars" and sent nothing — the message was silently lost while internal work sometimes continued. Add the api_calls==0 / not-failed / not-interrupted / not-partial branch to the single normalization chokepoint so the user gets a short retry hint instead of silence. Regression test asserts the hint surfaces. Salvaged from #33851 (re-applied on current main; original was 1401 commits behind and the function had moved).	2026-06-24 23:51:31 +05:30
brooklyn!	35e9c63d89	Merge pull request #52008 from infinitycrew39/fix/desktop-nous-onboarding-stale-provider fix(desktop): stop Nous Portal onboarding from validating stale Anthropic config	2026-06-24 13:12:44 -05:00
emozilla	6638199c53	fix(install): harden venv-resident process sweep on Windows Follow-up to the salvaged venv-recreate fix. Three changes to the Install-Venv pre-delete sweep: - Match the venv path with a case-insensitive StartsWith instead of the PowerShell -like operator. A venv path containing wildcard metacharacters ('[', ']') — legal in a Windows user name — silently fails to match under -like, which would let the locking process slip through and reintroduce the exact access-denied failure this fix closes. - Retry Remove-Item once after a short pause. A force-killed process can take a moment to release its file handles, so the first delete may still hit a locked .pyd; retry before failing the stage. - Note in a comment that the gateway autostart task runs at LIMITED integrity as the current user, so the installer always runs at equal-or-higher integrity and can read the process executable path, and that Get-CimInstance is preferred over Get-Process because it returns a null path for an uninspectable process instead of throwing. Adds a regression test asserting the recreate branch sweeps by venv path prefix, uses StartsWith rather than -like, and runs the sweep before Remove-Item. Covers issues #47036, #47557, #47910.	2026-06-24 13:25:44 -04:00
infinitycrew39	d8fe1c0b41	test(desktop): cover scoped onboarding runtime readiness checks Assert setup.runtime_check honors provider params and that Nous OAuth onboarding persists model config before validating the connected provider.	2026-06-24 23:19:51 +07:00
kshitij	c42d44cb2f	revert(plugins): restore user dashboard plugin backend API auto-import (#43719 ) (#51950 ) * Revert "refactor(security): centralize non-bundled plugin sources in one constant" This reverts commit `e2bea0abe6`. * Revert "fix(security): restrict dashboard plugin backend import to bundled plugins (#43719)" This reverts commit `8845f3316c`.	2026-06-24 07:46:54 -07:00
kshitij	7fb2027d85	Merge pull request #51881 from NousResearch/fix/29559-compression-abort-on-network-failure fix(compression): abort + preserve context on transient network summary failure (#29559, #25585)	2026-06-24 19:54:21 +05:30
Elshayib	1a435a6d5d	fix(model-switch): prevent custom-provider misattribution in model picker (#48305 ) When the current provider is a custom endpoint (custom or custom:), the model switch pipeline must NOT auto-switch to a native provider/OpenRouter based on a static-catalog match. The user explicitly configured their own endpoint and the same model name may be served there; silently rewriting model.provider destroys their config. - detect_static_provider_for_model(): skip the static-catalog scan when the current provider is custom/custom: - switch_model() Step e: extend is_custom to cover custom:* so the detect_provider_for_model() last-resort fallback cannot fire Salvaged from #48351 by Elshayib (authorship preserved). Fixes #48305	2026-06-24 19:34:33 +05:30
kyssta-exe	b85c460540	fix(tui): targeted save_config_value for model persistence (#48305 ) The TUI model-switch persistence (_persist_model_switch) rewrote the entire model config block via save_config(), destroying sibling keys the user set under model: (model_slots, model_fallback, base_url, ...) on every switch. Use targeted, atomic, comment-preserving save_config_value("model.default" / "model.provider" / "model.base_url") writes instead, so a model switch only touches the keys it changes. Salvaged from #48391 by kyssta-exe (authorship preserved). Fixes #48305	2026-06-24 19:34:33 +05:30
kshitij	2187fd884c	Merge pull request #51027 from NousResearch/salvage/typed-model-routing fix(model_switch): route typed configured models off openai-codex (#45006)	2026-06-24 19:32:35 +05:30
kshitijk4poor	1a174dfb50	fix(models): gate openai-codex/xai-oauth soft-accept to family-shaped slugs (#45006 ) Completes the #45006 fix. PR-base commit (configured-provider routing) handles the case where a typed model IS declared in user/custom provider config. This commit closes the other root: when a typed model is NOT in any config and the current provider is a soft-accepting one (openai-codex / xai-oauth), the hidden-model soft-accept (#16172 / #19729) would accept ANY unknown name as a hidden model — so `qwen3.5-4b` typed on a Codex-default session "succeeded" and mislabeled the provider as "OpenAI Codex" (the exact reported symptom), then 400'd on the next turn. Gate the soft-accept to slugs that plausibly belong to the provider's family (openai-codex -> gpt-/codex-/o1/o3/o4; xai-oauth -> grok-). Family-shaped unknown slugs are still soft-accepted (preserving the #16172 entitlement-gated hidden-model intent); unrelated names are rejected with actionable guidance to pin the right provider via `--provider <slug>` or the picker. Adds TestCodexSoftAcceptPlausibilityGate (5 tests): unrelated names rejected on codex/xai, family-shaped hidden slugs still accepted, real catalog models unaffected. Verified load-bearing.	2026-06-24 19:23:53 +05:30
kshitij	ae20c3fb90	Merge pull request #51025 from NousResearch/salvage/cron-autoreset-override fix(gateway): consume was_auto_reset so /model survives session auto-reset (#48031)	2026-06-24 19:20:11 +05:30
x7peeps	6879d77d74	fix(gateway): consume was_auto_reset so /model survives session auto-reset When `/model X` is the FIRST message after an idle/daily/suspended auto-reset, the slash-command path stores a session model override but leaves `session_entry.was_auto_reset = True` (it never passes through `_handle_message_with_agent`, which is where the flag was consumed). On the NEXT regular message, the auto-reset cleanup block pops the freshly-stored model/reasoning override BEFORE the flag is consumed — so the switch is silently lost and resolution falls back to the config default, while the session DB still shows the switched model (a two-sources-of-truth divergence). Consume the flag at both sites: 1. gateway/run.py — capture `was_auto_reset` into a local and set the attribute False immediately at the top of the cleanup block, so the cleanup can't re-fire on a later message and wipe an override stored between turns. Downstream reads use the captured local. 2. gateway/slash_commands.py — the model path consumes the flag before storing the override, so a /model-first-after-auto-reset isn't wiped by the next message's cleanup. Salvaged from #48062 by x7peeps (authorship preserved). Tests: tests/gateway/test_48031_model_switch_after_auto_reset.py — AST invariants pinning both consume sites (load-bearing; verified they fail when either consume is removed). Mirrors the AST-pin approach in test_35809_auto_reset_clean_context.py. Gateway session/reset suite: 16 passed. Fixes #48031	2026-06-24 19:12:44 +05:30
kshitij	d68a133458	Merge pull request #51890 from NousResearch/salvage/40695-handoff-watcher-async fix(gateway): offload handoff-watcher SQLite calls to avoid blocking the async heartbeat (#40695)	2026-06-24 19:10:52 +05:30
kshitij	7634488074	Merge pull request #51889 from NousResearch/salvage/41289-model-cmd-async fix(gateway): offload Discord /model provider-listing off the event loop (#41289)	2026-06-24 19:06:23 +05:30
kshitijk4poor	ab9134bf16	feat(openviking): add full recall prefetch policy Salvage of PR #48927 by @ehz0ah, which consolidates OpenViking recall work from #41706 (@huangxun375-stack), #33260, #49975, and #32444. Replaces stale background post-turn prefetch warming with synchronous current-query recall. The old queue_prefetch warmed the PREVIOUS user message while turn-start recall consumed the CURRENT one, so injected context was always about the wrong topic. Changes: - prefetch() now does session-aware /api/v1/search/search with the current query, falls back to /api/v1/search/find on failure - Contract-safe payloads: limit, score_threshold, context_type, session_id — no top_k, no search-body mode, no target_uri - L2 content reads for items with level=2 or empty abstracts, capped at full_read_limit (default 2) - Local ranking (score + query-token overlap + leaf boost), dedup, score threshold, and injected-char budget - queue_prefetch() is now a no-op (background warming removed) - Additive batched viking_read: uris param accepts up to 3 URIs - Per-request timeout support on _VikingClient.get/post/delete - Removes stale _prefetch_result/_prefetch_thread/_prefetch_generation state and _invalidate_prefetch_state() - Strengthened system_prompt_block guidance Salvage follow-up fixes: - Expose all 8 recall config knobs in get_config_schema() (PR #48927 had removed them; #41706 correctly exposed them). Env vars remain as internal mechanism but are now visible in setup wizard. - Lower default timeout 8s→4s, request_timeout 6s→3s, full_read_limit 3→2 to reduce per-turn blocking latency. Co-authored-by: Hao Zhe <haozhe4547@gmail.com> Co-authored-by: Eurekaxun <eurekaxun@163.com>	2026-06-24 18:53:49 +05:30
liuhao1024	721cf54fb1	fix(gateway): offload /model provider-listing off the event loop (#41289 ) The Discord/Telegram /model slash command listed providers synchronously on the gateway's async event loop. list_picker_providers / list_authenticated_providers are blocking and can fall through to a synchronous urllib HTTP fetch when the on-disk provider cache is stale, freezing the loop for 120-150s -> "application did not respond" and delayed agent starts. Port #41304's asyncio.to_thread offload to the current handler location. The handler moved from gateway/run.py to gateway/slash_commands.py (_handle_model_command); wrap BOTH blocking call sites so the whole bug class is covered: - picker path -> list_picker_providers - text-fallback path -> list_authenticated_providers asyncio.to_thread is already idiomatic in this module (and asyncio is imported), so the loop now stays responsive while the (possibly network-bound) listing runs on a worker thread. Adds tests/gateway/test_model_command_async_offload.py asserting the offload contract at the real handler seam for both paths (mutation- survivable: reverting either to_thread wrap fails the matching test). Co-authored-by: kshitijk4poor <82637225+kshitijk4poor@users.noreply.github.com>	2026-06-24 18:40:52 +05:30
r266-tech	f0c5d812b0	fix(gateway): offload handoff watcher SessionDB polling off the event loop The Discord gateway heartbeat stalled ('Shard ID None heartbeat blocked for more than N seconds') because _handoff_watcher polled the synchronous, blocking SQLite-backed SessionDB directly on the asyncio event loop every 2s. Each list_pending/claim/complete/fail call performed blocking disk I/O on the loop thread, starving the Discord heartbeat coroutine. Wrap every blocking SessionDB call inside the watcher loop in asyncio.to_thread(...) so the SQLite work runs on a worker thread and the event loop (and heartbeat) stays responsive. These four call sites are the only synchronous self._session_db.* calls inside the watcher loop body. Adds tests/gateway/test_handoff_watcher_async_db.py asserting the watcher offloads its SessionDB calls via asyncio.to_thread (mutation-survivable: reverting any to_thread wrap fails the corresponding assertion). Fixes #40695 Co-authored-by: kshitijk4poor <82637225+kshitijk4poor@users.noreply.github.com>	2026-06-24 18:40:23 +05:30
kshitijk4poor	ac822e4d36	fix(compression): abort (preserve context) on transient network summary failure (#29559 , #25585 ) When context compaction's summary generation fails, the compressor's default path (abort_on_summary_failure=False) drops the middle window and inserts a static 'summary unavailable' marker — destroying the compacted turns. #29559 reported the field impact: a Connection error at the compaction moment dropped 124->15 messages (110 lost) for a long browser-automation task; #25585 is the same failure mode (failed summary commits a destructive compaction anyway). compress() already has an EXCEPTION to the historical drop default: auth failures (401/403) ALWAYS abort and preserve the session, because rotating into a placeholder-summary child on a broken credential strands the user. A transient network/connection error is the same situation in reverse: it WILL recover, and retrying then is strictly better than discarding context for a momentary blip. Extend the always-abort carve-out to terminal connection/network failures: - new _last_summary_network_failure flag, set in _generate_summary's terminal failure branch when _is_connection_error(e) (reached only after any main-model fallback is exhausted), reset alongside the auth flag; - compress() aborts when it's set (returns messages unchanged, _last_compress_aborted=True), independent of abort_on_summary_failure; - a network-specific operator warning (distinct from the auth + config-flag messages). Scoped to connection errors only: a generic 500/400 still takes the historical fallback-drop path (test_non_auth_failure_still_uses_fallback_path stays green). Tests: network-failure detection + abort-despite-flag-false, both mutation-checked (removing the flag-set fails detection; removing the carve-out fails the abort).	2026-06-24 18:31:51 +05:30
xxxigm	89540d592b	test(cli): cover non-interactive prompt_yes_no fallback Regression coverage for the desktop gateway-restart hang: prompt_yes_no returns its default when HERMES_NONINTERACTIVE=1 or on a bare EOFError (closed/redirected stdin), and still exits on KeyboardInterrupt.	2026-06-24 17:56:30 +05:30
Ben	c93b9f9057	feat(relay): terminal 4401 (opt-out) → clean "Relay disabled" state Some checks are pending CI / detect (push) Waiting to run Details CI / tests (push) Blocked by required conditions Details CI / lint (push) Blocked by required conditions Details CI / typecheck (push) Blocked by required conditions Details CI / docs-site (push) Blocked by required conditions Details CI / history-check (push) Blocked by required conditions Details CI / contributor-check (push) Blocked by required conditions Details CI / uv-lockfile (push) Blocked by required conditions Details CI / docker-lint (push) Blocked by required conditions Details CI / supply-chain (push) Blocked by required conditions Details CI / osv-scanner (push) Blocked by required conditions Details CI / All required checks pass (push) Blocked by required conditions Details Deploy Site / deploy-vercel (push) Waiting to run Details Deploy Site / deploy-docs (push) Waiting to run Details Docker Build and Publish / build-amd64 (push) Waiting to run Details Docker Build and Publish / build-arm64 (push) Waiting to run Details Docker Build and Publish / merge (push) Blocked by required conditions Details Phase 7 Unit 7d-B. When an operator opts an instance OUT of the Team Gateway relay (Unit 7b deprovision), the connector revokes the per-gateway secret and closes the gateway's WS with 4401. The reconnect supervisor previously treated EVERY close as retryable, so the live process spun "retrying 4401" forever and the dashboard showed a red error — opt-out looked like a failure. Now a 4401 close that arrives AFTER a successful handshake is recognized as a terminal credential revocation: - ws_transport.py: track `_handshake_succeeded` (set when a descriptor is received); on a 4401 close after a prior success, latch `auth_revoked` and do NOT spawn the reconnect supervisor. A 4401 BEFORE any successful handshake stays retryable (cold-start / not-yet-provisioned race, not a revocation). New `auth_revoked` property + a websockets-version-safe close-code reader (prefers `.rcvd`/`.sent` Close frames; `.code` is deprecated in websockets 13+). - adapter.py: a revocation monitor turns `transport.auth_revoked` into a clean, NON-retryable `relay_disabled` fatal and notifies the gateway's fatal-error handler (so the adapter is removed and NOT queued for reconnection — the credential is dead until the instance is recreated). Monitor is cancelled on disconnect; only started when the transport exposes `auth_revoked` (prod WS). - run.py: `_handle_adapter_fatal_error` maps the `relay_disabled` code to a `disabled` platform_state (not `fatal`/`retrying`). - web: PlatformsCard renders the `disabled` state with a neutral outline badge, a PowerOff icon, and muted (not destructive-red) text + message. New optional `status.disabled` i18n string ("Disabled"). Also bundles the Phase 7 contract-doc update (this doc is authoritative in hermes-agent): docs/relay-connector-contract.md gains an "Author-first resolution + the account-link (DM) path" section documenting the multi-tenant-guild rule (D-7.2 — route by authenticated author binding, never by guild; unlinked → fail-closed), the `/link <code>` DM flow, and the connector-authoritative opt-out + terminal-4401 behavior this PR implements. Tests: +2 ws_transport (4401-after-handshake terminal / no-reconnect; 4401-before-handshake stays retryable) and +2 adapter (revocation → non-retryable relay_disabled fatal + handler fired; no-revocation → no fatal). 138 relay tests pass (incl. the contract-doc conformance test); ruff clean; web tsc clean. Phase 7 Unit 7d-B (relay-adapter solo lane). Q17 → Option 2; Option 3 (live de-register, no recreate) + the restart-re-provision hole deferred post-alpha.	2026-06-24 18:43:01 +10:00
Teknium	3c75e11571	fix(browser): validate agent-browser is runnable, not just present (#51740 ) After `hermes update`, a globally-installed agent-browser's npm postinstall (fixUnixSymlink) re-points the global symlink (e.g. /opt/homebrew/bin/agent-browser) at our local node_modules binary. The next update wipes node_modules, leaving a dangling symlink that `which` still reports but exec fails on with exit 127 — silently breaking every browser tool (#48521). Root cause is trust-on-presence: shutil.which/Path.exists accept a name that resolves but won't run. Add hermes_constants.agent_browser_runnable() (resolves the path + runs --version) and gate all four resolution sites on it: _find_agent_browser now skips a dead candidate and falls through to the next working one (extended PATH -> local .bin -> npx), self-healing the dangling link. dep_ensure/doctor/nous_subscription validate too; doctor warns on a broken link. Closes #48521.	2026-06-24 00:14:49 -07:00
Chaz Dinkle	abc3662bf6	fix(gateway): detect launchd in /restart service-manager probe (#43475 ) On a launchd-managed gateway (macOS), /restart stopped the gateway but never relaunched it: the handler's service detection checks only INVOCATION_ID (systemd) and container markers, so under launchd it takes the detached path and exits 0 — which KeepAlive.SuccessfulExit=false treats as a deliberate stop. The gateway stays silently dead until a manual launchctl kickstart. Detect launchd via XPC_SERVICE_NAME, which launchd sets to the job label for processes it spawns. The probe deliberately excludes the literal "0": interactive macOS shells inherit XPC_SERVICE_NAME=0 (a truthy string), and routing an unsupervised interactive gateway to the service path would make it exit non-zero with nothing to revive it. Routing through via_service=True (rather than forcing a non-zero exit on the detached path) matters: the detached path also spawns a helper that relaunches the gateway, so exiting non-zero there would have BOTH the helper and launchd respawn it — two gateways racing for the same bot tokens. The service path spawns no helper; launchd is the single respawner. Fixes #43475. Supersedes the run.py-era probes in #19940/#33393 (the handler has since moved to gateway/slash_commands.py) and avoids the double-spawn risk in the exit-code-site approaches (#43498, #43596).	2026-06-24 00:14:25 -07:00
Tranquil-Flow	73a20a6ad6	fix(telegram): clip mid-stream overflow instead of splitting (#48648 )	2026-06-24 00:00:46 -07:00
teknium1	ba50787180	test(anthropic-oauth): cover login token-endpoint host + fallback Add two regression tests for the salvaged #48706 fix: - login token exchange targets platform.claude.com first - falls back to console.anthropic.com when the new host is unreachable Also map the salvaged contributor's noreply email in release.py AUTHOR_MAP (CI author-map gate).	2026-06-23 23:59:40 -07:00
Teknium	be78fbd70e	Revert "fix(profiles): clone auth.json so OAuth credentials carry to cloned profiles (#51719 )" (#51732 ) This reverts commit `f504aecffe`.	2026-06-23 23:58:43 -07:00
justemu	4aa793345e	fix(matrix): use member_count as DM signal for named DM rooms Most Matrix clients auto-set a room name when creating a DM (e.g. "Alice & Bot" from participant display names), so the old `is_direct and not has_explicit_name` heuristic classified virtually all client-created DM rooms as "room", forcing require_mention gating in legitimate one-on-one DMs. member_count is now the primary DM signal: <=2 members means the room is necessarily a 1:1 conversation, regardless of m.direct or an explicit name. A room that grew to 3+ members but is still in stale m.direct is still classified as a room (conflict flag set). Falls back to the m.direct + name heuristic when the count is unavailable. Also hardens _get_room_member_count with a joined_members API fallback when the cache-backed state_store is empty. Salvaged from #48554 by @justemu onto the current plugin adapter path (gateway/platforms/matrix.py -> plugins/platforms/matrix/adapter.py). Fixes #48551	2026-06-23 23:57:38 -07:00
Teknium	0ef86febe2	docs(sessions): clarify sessions.json is the gateway routing index, not the session list (#51726 ) Users who inspect ~/.hermes/sessions/sessions.json see only gateway entries (e.g. agent:main:whatsapp:dm:...) and mistake it for the session index that hermes sessions list / /sessions read — which is actually state.db. Issue #49361 reported CLI sessions as 'invisible' on this premise. - gateway/session.py: write a self-documenting _README sentinel at the top of sessions.json explaining it's the gateway routing index and that ALL sessions (CLI/TUI/gateway) live in state.db; skip _-prefixed keys on load so the sentinel never round-trips into a SessionEntry. - Harden every sessions.json reader against the sentinel: mcp_serve loader, gateway/mirror.py, gateway/channel_directory.py all skip _-prefixed keys. - docs/user-guide/sessions.md: warning callout naming the exact symptom. - tests: assert prune ignores metadata sentinels; add round-trip coverage.	2026-06-23 23:56:36 -07:00
liuhao1024	7ff48a6291	fix(discord): check pairing store for component button auth Component button interactions (approve/deny, slash confirm, model picker, clarify) were not checking the pairing store for authorization. Users approved via `hermes pairing approve` could send messages and use slash commands (which go through the gateway authz_mixin), but button clicks were rejected because `_component_check_auth` only checked env-var allowlists (DISCORD_ALLOWED_USERS, GATEWAY_ALLOW_ALL_USERS, etc.) and not the pairing store. This was a regression from commit `f6f363662` which intentionally made component auth fail-closed when no allowlist is set (security fix for GHSA-mc26-p6fw-7pp6), but did not account for pairing-based auth. Fix: add a `PairingStore.is_approved("discord", uid)` check to `_component_check_auth`, mirroring `authz_mixin._check_authorization`. The pairing store check runs after all allowlist checks, preserving the fail-closed behavior for non-paired, non-allowed users. Fixes #50627	2026-06-23 23:55:18 -07:00
Teknium	0957d77187	test(agent): cover interrupt tool-tail alternation close (#48879 ) Regression coverage for the synthetic-assistant close: interrupt after a successful tool must persist an assistant tail (placeholder when no delivered text), real delivered text is preserved, and non-interrupted or non-tool tails are left untouched.	2026-06-23 23:52:28 -07:00
teknium1	53f8386587	test(delegation): regression for bedrock Claude target_model api_mode routing Asserts resolve_runtime_provider honors target_model over the stale persisted model.default when choosing the Bedrock dual-path api_mode: Claude target -> anthropic_messages, Nova target -> bedrock_converse. Both fail without the #49095 fix.	2026-06-23 23:49:37 -07:00
teknium1	d4be583d98	fix(telegram): raise default command-menu cap to 60 so skills stay visible The 30-slot default could not fit Hermes's ~50 built-in commands, so every skill command (and 20 built-ins) were silently dropped from the Telegram \`/\` menu by default — they only worked when typed manually. Raising the default to 60 keeps all built-ins plus common skill commands visible out of the box while staying under Telegram's ~4KB payload limit. Users can still tune it via platforms.telegram.extra.command_menu.	2026-06-23 23:49:22 -07:00
Thestral	dbe14ce35d	feat(gateway): configure Telegram command menu priority Adds a configurable Telegram BotCommand menu cap and priority list via platforms.telegram.extra.command_menu (max_commands clamped 1..100; priority_mode prepend\|append\|replace). Default cap stays 30; hidden commands remain invokable when typed and /commands lists the full set. Salvaged from PR #42021. Cherry-picked onto current main; the original edited gateway/platforms/telegram.py, now relocated to plugins/platforms/telegram/adapter.py.	2026-06-23 23:49:22 -07:00
Teknium	f504aecffe	fix(profiles): clone auth.json so OAuth credentials carry to cloned profiles (#51719 ) Selective --clone / --clone-from / --clone-config copied .env but not auth.json, silently dropping the credential pool — including OAuth tokens (Anthropic `claude /login`, Codex, xAI) that never land in .env. A profile cloned from an OAuth-authenticated default therefore resolved a different provider (or none) than the source under provider: auto. --clone-all already carried auth.json via the full copytree; only the selective path missed it. Add auth.json to _CLONE_CONFIG_FILES and tighten it to 0o600 after copy, matching .env semantics.	2026-06-23 23:44:34 -07:00
Teknium	050bd01b7b	fix(dashboard): serve uvicorn on SelectorEventLoop on Windows (#50641 ) (#51717 ) On Windows, start_server() served uvicorn via a bare asyncio.run(_serve()), which uses the default ProactorEventLoop. uvicorn's socket-serving stack assumes a SelectorEventLoop on win32 (uvicorn/loops/asyncio.py forces it, and uvicorn.Server.run threads config.get_loop_factory() into its runner for exactly this reason). Driving uvicorn on the proactor loop makes server.startup() bind a socket that never accepts: the dashboard and desktop backend print "Skipping web UI build" then hang forever with the port LISTENING but no TCP handshake completing. Fix is win32-scoped to keep the blast radius minimal: POSIX keeps the exact asyncio.run(_serve()) it had (its default loop is already a SelectorEventLoop / uvloop, which is what uvicorn serves on). Only on Windows do we mirror uvicorn.Server.run and run on the loop factory uvicorn picks, with a fallback to WindowsSelectorEventLoopPolicy for uvicorn < 0.36. Fixes hermes dashboard and hermes desktop (the Electron app spawns a hermes dashboard backend). The gateway symptom in the report has a separate root cause (no uvicorn) and is not addressed here.	2026-06-23 23:43:24 -07:00
teknium1	901165b5a4	fix(cron): complete plugins.cron_providers rename in 2 missed test files uperLu's #50958 renamed plugins/cron → plugins/cron_providers but left two test files patching the now-gone plugins.cron.chronos.verify path, which would fail collection. Point them at plugins.cron_providers.*. Add uperLu to release.py AUTHOR_MAP.	2026-06-23 23:39:22 -07:00
uperLu	0d4cecb352	fix(cron): avoid provider package shadowing core cron	2026-06-23 23:39:22 -07:00
Ben	31bced1607	fix(profiles): detect a separate-process gateway in profile status The dashboard Profiles view showed "Gateway stopped" for a gateway that is in fact running — while the sidebar status strip and `hermes gateway status` (CLI) both correctly showed it running. Reported on v0.17.0 running the gateway + dashboard in one Docker container. Root cause: three liveness surfaces with three detection strengths, all reading the same `gateway.pid`: - `hermes gateway status` -> find_gateway_pids() (process-table scan) - sidebar /api/status -> get_running_pid() + gateway_state.json PID fallback + health-URL probe - Profiles view -> _check_gateway_running() = get_running_pid() ONLY, no fallback `get_running_pid()` short-circuits to None the moment the runtime lock (`gateway.lock`) doesn't register as held by the calling process — which is always true when the reader is a separate process from the gateway (the dashboard is its own s6 service in the container), and also for any launch-service-managed gateway that left a fresh `gateway_state.json` but no live PID file. So the Profiles view alone reported the live gateway as stopped. Fix: give _check_gateway_running the same fallback the sidebar already has — after the pid-file/lock check misses, validate the PID recorded in that profile's gateway_state.json against the live process table via the existing get_runtime_status_running_pid(). read_runtime_status() gains an optional path arg so a profile's state file can be read without mutating the process-global HERMES_HOME (preserving the contextvar-based profile isolation the dashboard relies on). Backward compatible: every existing caller passes no argument. Tests: a regression test that fails pre-fix (live gateway, lock check returns None -> must still report running) and a guard test that a 'stopped' state file is never reported running even with a live PID.	2026-06-24 16:36:17 +10:00
teknium1	366c2a3766	fix(gateway): propagate fatal-config exit code through start_gateway clean-exit path The contributor PR stamped runner._exit_code=78 on non-retryable startup errors, but start_gateway()'s clean-exit branch returned True before the SystemExit(runner.exit_code) site, so main() exited 0. The s6 finish script's [ "$1" = "78" ] check never matched and s6 crash-looped the gateway anyway — the fix was dead as shipped (#51228). Honor runner.exit_code in the clean-exit branch: raise SystemExit(code) when set, else return True (normal /restart clean exit). Add a start_gateway()-level test that asserts process-level SystemExit(78) propagation — the gap the PR's object-level test missed — plus exit_code on the existing _CleanExitRunner mocks.	2026-06-24 16:34:51 +10:00
Francesco Mucio	776f68e1ee	fix(gateway): exit 78 (EX_CONFIG) on fatal startup errors, s6 finish script stops restart loop Profiles without their own messaging token inherit the default profile's token via os.getenv, hit a token collision, and exit with startup_failed. s6 restarts them immediately, creating ~30MB tirith sandbox dirs in /tmp each cycle — filling the disk in hours (#51228). Changes: - gateway/restart.py: add GATEWAY_FATAL_CONFIG_EXIT_CODE = 78 - gateway/run.py: set exit_code=78 on non-retryable startup errors (token collision, no platforms) - hermes_cli/service_manager.py: add _render_finish_script() that translates exit 78 → exit 125 (s6 permanent failure) - hermes_cli/container_boot.py: write finish script alongside run script during profile registration The s6 finish script pattern follows docker/s6-rc.d/dashboard/finish. Closes #51228	2026-06-24 16:34:51 +10:00
Teknium	d93d0aee83	fix(cron): anchor naive schedule timestamps to configured timezone (#51695 ) A naive ISO timestamp (e.g. 2026-06-22T20:07:00) was anchored to the server's local timezone via dt.astimezone(), but the due-check (get_due_jobs -> _hermes_now()) runs in the CONFIGURED Hermes timezone. When the two diverge (cloud host on UTC with a different timezone: set, or vice-versa) the stored instant lands hours off the user's wall-clock intent, so one-shots never become due and recurring jobs fire at the wrong time. The ticker stays healthy (heartbeat + success markers fresh) because every tick finds nothing due, matching the silent no-fire in #51021. Anchor naive timestamps to _hermes_now().tzinfo so '20:07' means 20:07 on the same clock the scheduler checks against. The legacy _ensure_aware path still treats already-stored naive values as server-local for back-compat. Fixes #51021	2026-06-23 23:29:57 -07:00
Teknium	78e122ae1a	feat(cron): warn when gateway not running on cron create/list (#51696 ) The cron ticker only runs inside the gateway (_start_cron_ticker); there is no standalone cron daemon. When the gateway isn't running, next_run_at passes but jobs never fire and last_run_at stays null — and manual 'hermes cron run' (which bypasses the ticker) appears to work, masking the real cause. This is the most common cron support report (#51038). cron list already warned; extend the same warning to cron create (the moment the user is most likely to hit this) via a shared helper, and add a pointer to 'hermes cron status'. Silent when a gateway is running, so the gateway /cron path is unaffected.	2026-06-23 23:29:50 -07:00

1 2 3 4 5 ...

6126 commits