hermes-agent

mirror of https://github.com/NousResearch/hermes-agent.git synced 2026-06-27 11:22:03 +00:00

Author	SHA1	Message	Date
kshitij	c42d44cb2f	revert(plugins): restore user dashboard plugin backend API auto-import (#43719 ) (#51950 ) * Revert "refactor(security): centralize non-bundled plugin sources in one constant" This reverts commit `e2bea0abe6`. * Revert "fix(security): restrict dashboard plugin backend import to bundled plugins (#43719)" This reverts commit `8845f3316c`.	2026-06-24 07:46:54 -07:00
kshitij	7fb2027d85	Merge pull request #51881 from NousResearch/fix/29559-compression-abort-on-network-failure fix(compression): abort + preserve context on transient network summary failure (#29559, #25585)	2026-06-24 19:54:21 +05:30
Elshayib	1a435a6d5d	fix(model-switch): prevent custom-provider misattribution in model picker (#48305 ) When the current provider is a custom endpoint (custom or custom:), the model switch pipeline must NOT auto-switch to a native provider/OpenRouter based on a static-catalog match. The user explicitly configured their own endpoint and the same model name may be served there; silently rewriting model.provider destroys their config. - detect_static_provider_for_model(): skip the static-catalog scan when the current provider is custom/custom: - switch_model() Step e: extend is_custom to cover custom:* so the detect_provider_for_model() last-resort fallback cannot fire Salvaged from #48351 by Elshayib (authorship preserved). Fixes #48305	2026-06-24 19:34:33 +05:30
kyssta-exe	b85c460540	fix(tui): targeted save_config_value for model persistence (#48305 ) The TUI model-switch persistence (_persist_model_switch) rewrote the entire model config block via save_config(), destroying sibling keys the user set under model: (model_slots, model_fallback, base_url, ...) on every switch. Use targeted, atomic, comment-preserving save_config_value("model.default" / "model.provider" / "model.base_url") writes instead, so a model switch only touches the keys it changes. Salvaged from #48391 by kyssta-exe (authorship preserved). Fixes #48305	2026-06-24 19:34:33 +05:30
kshitij	2187fd884c	Merge pull request #51027 from NousResearch/salvage/typed-model-routing fix(model_switch): route typed configured models off openai-codex (#45006)	2026-06-24 19:32:35 +05:30
kshitijk4poor	1a174dfb50	fix(models): gate openai-codex/xai-oauth soft-accept to family-shaped slugs (#45006 ) Completes the #45006 fix. PR-base commit (configured-provider routing) handles the case where a typed model IS declared in user/custom provider config. This commit closes the other root: when a typed model is NOT in any config and the current provider is a soft-accepting one (openai-codex / xai-oauth), the hidden-model soft-accept (#16172 / #19729) would accept ANY unknown name as a hidden model — so `qwen3.5-4b` typed on a Codex-default session "succeeded" and mislabeled the provider as "OpenAI Codex" (the exact reported symptom), then 400'd on the next turn. Gate the soft-accept to slugs that plausibly belong to the provider's family (openai-codex -> gpt-/codex-/o1/o3/o4; xai-oauth -> grok-). Family-shaped unknown slugs are still soft-accepted (preserving the #16172 entitlement-gated hidden-model intent); unrelated names are rejected with actionable guidance to pin the right provider via `--provider <slug>` or the picker. Adds TestCodexSoftAcceptPlausibilityGate (5 tests): unrelated names rejected on codex/xai, family-shaped hidden slugs still accepted, real catalog models unaffected. Verified load-bearing.	2026-06-24 19:23:53 +05:30
kshitij	ae20c3fb90	Merge pull request #51025 from NousResearch/salvage/cron-autoreset-override fix(gateway): consume was_auto_reset so /model survives session auto-reset (#48031)	2026-06-24 19:20:11 +05:30
x7peeps	6879d77d74	fix(gateway): consume was_auto_reset so /model survives session auto-reset When `/model X` is the FIRST message after an idle/daily/suspended auto-reset, the slash-command path stores a session model override but leaves `session_entry.was_auto_reset = True` (it never passes through `_handle_message_with_agent`, which is where the flag was consumed). On the NEXT regular message, the auto-reset cleanup block pops the freshly-stored model/reasoning override BEFORE the flag is consumed — so the switch is silently lost and resolution falls back to the config default, while the session DB still shows the switched model (a two-sources-of-truth divergence). Consume the flag at both sites: 1. gateway/run.py — capture `was_auto_reset` into a local and set the attribute False immediately at the top of the cleanup block, so the cleanup can't re-fire on a later message and wipe an override stored between turns. Downstream reads use the captured local. 2. gateway/slash_commands.py — the model path consumes the flag before storing the override, so a /model-first-after-auto-reset isn't wiped by the next message's cleanup. Salvaged from #48062 by x7peeps (authorship preserved). Tests: tests/gateway/test_48031_model_switch_after_auto_reset.py — AST invariants pinning both consume sites (load-bearing; verified they fail when either consume is removed). Mirrors the AST-pin approach in test_35809_auto_reset_clean_context.py. Gateway session/reset suite: 16 passed. Fixes #48031	2026-06-24 19:12:44 +05:30
kshitij	d68a133458	Merge pull request #51890 from NousResearch/salvage/40695-handoff-watcher-async fix(gateway): offload handoff-watcher SQLite calls to avoid blocking the async heartbeat (#40695)	2026-06-24 19:10:52 +05:30
kshitij	7634488074	Merge pull request #51889 from NousResearch/salvage/41289-model-cmd-async fix(gateway): offload Discord /model provider-listing off the event loop (#41289)	2026-06-24 19:06:23 +05:30
kshitijk4poor	ab9134bf16	feat(openviking): add full recall prefetch policy Salvage of PR #48927 by @ehz0ah, which consolidates OpenViking recall work from #41706 (@huangxun375-stack), #33260, #49975, and #32444. Replaces stale background post-turn prefetch warming with synchronous current-query recall. The old queue_prefetch warmed the PREVIOUS user message while turn-start recall consumed the CURRENT one, so injected context was always about the wrong topic. Changes: - prefetch() now does session-aware /api/v1/search/search with the current query, falls back to /api/v1/search/find on failure - Contract-safe payloads: limit, score_threshold, context_type, session_id — no top_k, no search-body mode, no target_uri - L2 content reads for items with level=2 or empty abstracts, capped at full_read_limit (default 2) - Local ranking (score + query-token overlap + leaf boost), dedup, score threshold, and injected-char budget - queue_prefetch() is now a no-op (background warming removed) - Additive batched viking_read: uris param accepts up to 3 URIs - Per-request timeout support on _VikingClient.get/post/delete - Removes stale _prefetch_result/_prefetch_thread/_prefetch_generation state and _invalidate_prefetch_state() - Strengthened system_prompt_block guidance Salvage follow-up fixes: - Expose all 8 recall config knobs in get_config_schema() (PR #48927 had removed them; #41706 correctly exposed them). Env vars remain as internal mechanism but are now visible in setup wizard. - Lower default timeout 8s→4s, request_timeout 6s→3s, full_read_limit 3→2 to reduce per-turn blocking latency. Co-authored-by: Hao Zhe <haozhe4547@gmail.com> Co-authored-by: Eurekaxun <eurekaxun@163.com>	2026-06-24 18:53:49 +05:30
liuhao1024	721cf54fb1	fix(gateway): offload /model provider-listing off the event loop (#41289 ) The Discord/Telegram /model slash command listed providers synchronously on the gateway's async event loop. list_picker_providers / list_authenticated_providers are blocking and can fall through to a synchronous urllib HTTP fetch when the on-disk provider cache is stale, freezing the loop for 120-150s -> "application did not respond" and delayed agent starts. Port #41304's asyncio.to_thread offload to the current handler location. The handler moved from gateway/run.py to gateway/slash_commands.py (_handle_model_command); wrap BOTH blocking call sites so the whole bug class is covered: - picker path -> list_picker_providers - text-fallback path -> list_authenticated_providers asyncio.to_thread is already idiomatic in this module (and asyncio is imported), so the loop now stays responsive while the (possibly network-bound) listing runs on a worker thread. Adds tests/gateway/test_model_command_async_offload.py asserting the offload contract at the real handler seam for both paths (mutation- survivable: reverting either to_thread wrap fails the matching test). Co-authored-by: kshitijk4poor <82637225+kshitijk4poor@users.noreply.github.com>	2026-06-24 18:40:52 +05:30
r266-tech	f0c5d812b0	fix(gateway): offload handoff watcher SessionDB polling off the event loop The Discord gateway heartbeat stalled ('Shard ID None heartbeat blocked for more than N seconds') because _handoff_watcher polled the synchronous, blocking SQLite-backed SessionDB directly on the asyncio event loop every 2s. Each list_pending/claim/complete/fail call performed blocking disk I/O on the loop thread, starving the Discord heartbeat coroutine. Wrap every blocking SessionDB call inside the watcher loop in asyncio.to_thread(...) so the SQLite work runs on a worker thread and the event loop (and heartbeat) stays responsive. These four call sites are the only synchronous self._session_db.* calls inside the watcher loop body. Adds tests/gateway/test_handoff_watcher_async_db.py asserting the watcher offloads its SessionDB calls via asyncio.to_thread (mutation-survivable: reverting any to_thread wrap fails the corresponding assertion). Fixes #40695 Co-authored-by: kshitijk4poor <82637225+kshitijk4poor@users.noreply.github.com>	2026-06-24 18:40:23 +05:30
kshitijk4poor	ac822e4d36	fix(compression): abort (preserve context) on transient network summary failure (#29559 , #25585 ) When context compaction's summary generation fails, the compressor's default path (abort_on_summary_failure=False) drops the middle window and inserts a static 'summary unavailable' marker — destroying the compacted turns. #29559 reported the field impact: a Connection error at the compaction moment dropped 124->15 messages (110 lost) for a long browser-automation task; #25585 is the same failure mode (failed summary commits a destructive compaction anyway). compress() already has an EXCEPTION to the historical drop default: auth failures (401/403) ALWAYS abort and preserve the session, because rotating into a placeholder-summary child on a broken credential strands the user. A transient network/connection error is the same situation in reverse: it WILL recover, and retrying then is strictly better than discarding context for a momentary blip. Extend the always-abort carve-out to terminal connection/network failures: - new _last_summary_network_failure flag, set in _generate_summary's terminal failure branch when _is_connection_error(e) (reached only after any main-model fallback is exhausted), reset alongside the auth flag; - compress() aborts when it's set (returns messages unchanged, _last_compress_aborted=True), independent of abort_on_summary_failure; - a network-specific operator warning (distinct from the auth + config-flag messages). Scoped to connection errors only: a generic 500/400 still takes the historical fallback-drop path (test_non_auth_failure_still_uses_fallback_path stays green). Tests: network-failure detection + abort-despite-flag-false, both mutation-checked (removing the flag-set fails detection; removing the carve-out fails the abort).	2026-06-24 18:31:51 +05:30
xxxigm	89540d592b	test(cli): cover non-interactive prompt_yes_no fallback Regression coverage for the desktop gateway-restart hang: prompt_yes_no returns its default when HERMES_NONINTERACTIVE=1 or on a bare EOFError (closed/redirected stdin), and still exits on KeyboardInterrupt.	2026-06-24 17:56:30 +05:30
Ben	c93b9f9057	feat(relay): terminal 4401 (opt-out) → clean "Relay disabled" state Some checks are pending CI / detect (push) Waiting to run Details CI / tests (push) Blocked by required conditions Details CI / lint (push) Blocked by required conditions Details CI / typecheck (push) Blocked by required conditions Details CI / docs-site (push) Blocked by required conditions Details CI / history-check (push) Blocked by required conditions Details CI / contributor-check (push) Blocked by required conditions Details CI / uv-lockfile (push) Blocked by required conditions Details CI / docker-lint (push) Blocked by required conditions Details CI / supply-chain (push) Blocked by required conditions Details CI / osv-scanner (push) Blocked by required conditions Details CI / All required checks pass (push) Blocked by required conditions Details Deploy Site / deploy-vercel (push) Waiting to run Details Deploy Site / deploy-docs (push) Waiting to run Details Docker Build and Publish / build-amd64 (push) Waiting to run Details Docker Build and Publish / build-arm64 (push) Waiting to run Details Docker Build and Publish / merge (push) Blocked by required conditions Details Phase 7 Unit 7d-B. When an operator opts an instance OUT of the Team Gateway relay (Unit 7b deprovision), the connector revokes the per-gateway secret and closes the gateway's WS with 4401. The reconnect supervisor previously treated EVERY close as retryable, so the live process spun "retrying 4401" forever and the dashboard showed a red error — opt-out looked like a failure. Now a 4401 close that arrives AFTER a successful handshake is recognized as a terminal credential revocation: - ws_transport.py: track `_handshake_succeeded` (set when a descriptor is received); on a 4401 close after a prior success, latch `auth_revoked` and do NOT spawn the reconnect supervisor. A 4401 BEFORE any successful handshake stays retryable (cold-start / not-yet-provisioned race, not a revocation). New `auth_revoked` property + a websockets-version-safe close-code reader (prefers `.rcvd`/`.sent` Close frames; `.code` is deprecated in websockets 13+). - adapter.py: a revocation monitor turns `transport.auth_revoked` into a clean, NON-retryable `relay_disabled` fatal and notifies the gateway's fatal-error handler (so the adapter is removed and NOT queued for reconnection — the credential is dead until the instance is recreated). Monitor is cancelled on disconnect; only started when the transport exposes `auth_revoked` (prod WS). - run.py: `_handle_adapter_fatal_error` maps the `relay_disabled` code to a `disabled` platform_state (not `fatal`/`retrying`). - web: PlatformsCard renders the `disabled` state with a neutral outline badge, a PowerOff icon, and muted (not destructive-red) text + message. New optional `status.disabled` i18n string ("Disabled"). Also bundles the Phase 7 contract-doc update (this doc is authoritative in hermes-agent): docs/relay-connector-contract.md gains an "Author-first resolution + the account-link (DM) path" section documenting the multi-tenant-guild rule (D-7.2 — route by authenticated author binding, never by guild; unlinked → fail-closed), the `/link <code>` DM flow, and the connector-authoritative opt-out + terminal-4401 behavior this PR implements. Tests: +2 ws_transport (4401-after-handshake terminal / no-reconnect; 4401-before-handshake stays retryable) and +2 adapter (revocation → non-retryable relay_disabled fatal + handler fired; no-revocation → no fatal). 138 relay tests pass (incl. the contract-doc conformance test); ruff clean; web tsc clean. Phase 7 Unit 7d-B (relay-adapter solo lane). Q17 → Option 2; Option 3 (live de-register, no recreate) + the restart-re-provision hole deferred post-alpha.	2026-06-24 18:43:01 +10:00
Teknium	3c75e11571	fix(browser): validate agent-browser is runnable, not just present (#51740 ) After `hermes update`, a globally-installed agent-browser's npm postinstall (fixUnixSymlink) re-points the global symlink (e.g. /opt/homebrew/bin/agent-browser) at our local node_modules binary. The next update wipes node_modules, leaving a dangling symlink that `which` still reports but exec fails on with exit 127 — silently breaking every browser tool (#48521). Root cause is trust-on-presence: shutil.which/Path.exists accept a name that resolves but won't run. Add hermes_constants.agent_browser_runnable() (resolves the path + runs --version) and gate all four resolution sites on it: _find_agent_browser now skips a dead candidate and falls through to the next working one (extended PATH -> local .bin -> npx), self-healing the dangling link. dep_ensure/doctor/nous_subscription validate too; doctor warns on a broken link. Closes #48521.	2026-06-24 00:14:49 -07:00
Chaz Dinkle	abc3662bf6	fix(gateway): detect launchd in /restart service-manager probe (#43475 ) On a launchd-managed gateway (macOS), /restart stopped the gateway but never relaunched it: the handler's service detection checks only INVOCATION_ID (systemd) and container markers, so under launchd it takes the detached path and exits 0 — which KeepAlive.SuccessfulExit=false treats as a deliberate stop. The gateway stays silently dead until a manual launchctl kickstart. Detect launchd via XPC_SERVICE_NAME, which launchd sets to the job label for processes it spawns. The probe deliberately excludes the literal "0": interactive macOS shells inherit XPC_SERVICE_NAME=0 (a truthy string), and routing an unsupervised interactive gateway to the service path would make it exit non-zero with nothing to revive it. Routing through via_service=True (rather than forcing a non-zero exit on the detached path) matters: the detached path also spawns a helper that relaunches the gateway, so exiting non-zero there would have BOTH the helper and launchd respawn it — two gateways racing for the same bot tokens. The service path spawns no helper; launchd is the single respawner. Fixes #43475. Supersedes the run.py-era probes in #19940/#33393 (the handler has since moved to gateway/slash_commands.py) and avoids the double-spawn risk in the exit-code-site approaches (#43498, #43596).	2026-06-24 00:14:25 -07:00
Tranquil-Flow	73a20a6ad6	fix(telegram): clip mid-stream overflow instead of splitting (#48648 )	2026-06-24 00:00:46 -07:00
teknium1	ba50787180	test(anthropic-oauth): cover login token-endpoint host + fallback Add two regression tests for the salvaged #48706 fix: - login token exchange targets platform.claude.com first - falls back to console.anthropic.com when the new host is unreachable Also map the salvaged contributor's noreply email in release.py AUTHOR_MAP (CI author-map gate).	2026-06-23 23:59:40 -07:00
Teknium	be78fbd70e	Revert "fix(profiles): clone auth.json so OAuth credentials carry to cloned profiles (#51719 )" (#51732 ) This reverts commit `f504aecffe`.	2026-06-23 23:58:43 -07:00
justemu	4aa793345e	fix(matrix): use member_count as DM signal for named DM rooms Most Matrix clients auto-set a room name when creating a DM (e.g. "Alice & Bot" from participant display names), so the old `is_direct and not has_explicit_name` heuristic classified virtually all client-created DM rooms as "room", forcing require_mention gating in legitimate one-on-one DMs. member_count is now the primary DM signal: <=2 members means the room is necessarily a 1:1 conversation, regardless of m.direct or an explicit name. A room that grew to 3+ members but is still in stale m.direct is still classified as a room (conflict flag set). Falls back to the m.direct + name heuristic when the count is unavailable. Also hardens _get_room_member_count with a joined_members API fallback when the cache-backed state_store is empty. Salvaged from #48554 by @justemu onto the current plugin adapter path (gateway/platforms/matrix.py -> plugins/platforms/matrix/adapter.py). Fixes #48551	2026-06-23 23:57:38 -07:00
Teknium	0ef86febe2	docs(sessions): clarify sessions.json is the gateway routing index, not the session list (#51726 ) Users who inspect ~/.hermes/sessions/sessions.json see only gateway entries (e.g. agent:main:whatsapp:dm:...) and mistake it for the session index that hermes sessions list / /sessions read — which is actually state.db. Issue #49361 reported CLI sessions as 'invisible' on this premise. - gateway/session.py: write a self-documenting _README sentinel at the top of sessions.json explaining it's the gateway routing index and that ALL sessions (CLI/TUI/gateway) live in state.db; skip _-prefixed keys on load so the sentinel never round-trips into a SessionEntry. - Harden every sessions.json reader against the sentinel: mcp_serve loader, gateway/mirror.py, gateway/channel_directory.py all skip _-prefixed keys. - docs/user-guide/sessions.md: warning callout naming the exact symptom. - tests: assert prune ignores metadata sentinels; add round-trip coverage.	2026-06-23 23:56:36 -07:00
liuhao1024	7ff48a6291	fix(discord): check pairing store for component button auth Component button interactions (approve/deny, slash confirm, model picker, clarify) were not checking the pairing store for authorization. Users approved via `hermes pairing approve` could send messages and use slash commands (which go through the gateway authz_mixin), but button clicks were rejected because `_component_check_auth` only checked env-var allowlists (DISCORD_ALLOWED_USERS, GATEWAY_ALLOW_ALL_USERS, etc.) and not the pairing store. This was a regression from commit `f6f363662` which intentionally made component auth fail-closed when no allowlist is set (security fix for GHSA-mc26-p6fw-7pp6), but did not account for pairing-based auth. Fix: add a `PairingStore.is_approved("discord", uid)` check to `_component_check_auth`, mirroring `authz_mixin._check_authorization`. The pairing store check runs after all allowlist checks, preserving the fail-closed behavior for non-paired, non-allowed users. Fixes #50627	2026-06-23 23:55:18 -07:00
Teknium	0957d77187	test(agent): cover interrupt tool-tail alternation close (#48879 ) Regression coverage for the synthetic-assistant close: interrupt after a successful tool must persist an assistant tail (placeholder when no delivered text), real delivered text is preserved, and non-interrupted or non-tool tails are left untouched.	2026-06-23 23:52:28 -07:00
teknium1	53f8386587	test(delegation): regression for bedrock Claude target_model api_mode routing Asserts resolve_runtime_provider honors target_model over the stale persisted model.default when choosing the Bedrock dual-path api_mode: Claude target -> anthropic_messages, Nova target -> bedrock_converse. Both fail without the #49095 fix.	2026-06-23 23:49:37 -07:00
teknium1	d4be583d98	fix(telegram): raise default command-menu cap to 60 so skills stay visible The 30-slot default could not fit Hermes's ~50 built-in commands, so every skill command (and 20 built-ins) were silently dropped from the Telegram \`/\` menu by default — they only worked when typed manually. Raising the default to 60 keeps all built-ins plus common skill commands visible out of the box while staying under Telegram's ~4KB payload limit. Users can still tune it via platforms.telegram.extra.command_menu.	2026-06-23 23:49:22 -07:00
Thestral	dbe14ce35d	feat(gateway): configure Telegram command menu priority Adds a configurable Telegram BotCommand menu cap and priority list via platforms.telegram.extra.command_menu (max_commands clamped 1..100; priority_mode prepend\|append\|replace). Default cap stays 30; hidden commands remain invokable when typed and /commands lists the full set. Salvaged from PR #42021. Cherry-picked onto current main; the original edited gateway/platforms/telegram.py, now relocated to plugins/platforms/telegram/adapter.py.	2026-06-23 23:49:22 -07:00
Teknium	f504aecffe	fix(profiles): clone auth.json so OAuth credentials carry to cloned profiles (#51719 ) Selective --clone / --clone-from / --clone-config copied .env but not auth.json, silently dropping the credential pool — including OAuth tokens (Anthropic `claude /login`, Codex, xAI) that never land in .env. A profile cloned from an OAuth-authenticated default therefore resolved a different provider (or none) than the source under provider: auto. --clone-all already carried auth.json via the full copytree; only the selective path missed it. Add auth.json to _CLONE_CONFIG_FILES and tighten it to 0o600 after copy, matching .env semantics.	2026-06-23 23:44:34 -07:00
Teknium	050bd01b7b	fix(dashboard): serve uvicorn on SelectorEventLoop on Windows (#50641 ) (#51717 ) On Windows, start_server() served uvicorn via a bare asyncio.run(_serve()), which uses the default ProactorEventLoop. uvicorn's socket-serving stack assumes a SelectorEventLoop on win32 (uvicorn/loops/asyncio.py forces it, and uvicorn.Server.run threads config.get_loop_factory() into its runner for exactly this reason). Driving uvicorn on the proactor loop makes server.startup() bind a socket that never accepts: the dashboard and desktop backend print "Skipping web UI build" then hang forever with the port LISTENING but no TCP handshake completing. Fix is win32-scoped to keep the blast radius minimal: POSIX keeps the exact asyncio.run(_serve()) it had (its default loop is already a SelectorEventLoop / uvloop, which is what uvicorn serves on). Only on Windows do we mirror uvicorn.Server.run and run on the loop factory uvicorn picks, with a fallback to WindowsSelectorEventLoopPolicy for uvicorn < 0.36. Fixes hermes dashboard and hermes desktop (the Electron app spawns a hermes dashboard backend). The gateway symptom in the report has a separate root cause (no uvicorn) and is not addressed here.	2026-06-23 23:43:24 -07:00
teknium1	901165b5a4	fix(cron): complete plugins.cron_providers rename in 2 missed test files uperLu's #50958 renamed plugins/cron → plugins/cron_providers but left two test files patching the now-gone plugins.cron.chronos.verify path, which would fail collection. Point them at plugins.cron_providers.*. Add uperLu to release.py AUTHOR_MAP.	2026-06-23 23:39:22 -07:00
uperLu	0d4cecb352	fix(cron): avoid provider package shadowing core cron	2026-06-23 23:39:22 -07:00
Ben	31bced1607	fix(profiles): detect a separate-process gateway in profile status The dashboard Profiles view showed "Gateway stopped" for a gateway that is in fact running — while the sidebar status strip and `hermes gateway status` (CLI) both correctly showed it running. Reported on v0.17.0 running the gateway + dashboard in one Docker container. Root cause: three liveness surfaces with three detection strengths, all reading the same `gateway.pid`: - `hermes gateway status` -> find_gateway_pids() (process-table scan) - sidebar /api/status -> get_running_pid() + gateway_state.json PID fallback + health-URL probe - Profiles view -> _check_gateway_running() = get_running_pid() ONLY, no fallback `get_running_pid()` short-circuits to None the moment the runtime lock (`gateway.lock`) doesn't register as held by the calling process — which is always true when the reader is a separate process from the gateway (the dashboard is its own s6 service in the container), and also for any launch-service-managed gateway that left a fresh `gateway_state.json` but no live PID file. So the Profiles view alone reported the live gateway as stopped. Fix: give _check_gateway_running the same fallback the sidebar already has — after the pid-file/lock check misses, validate the PID recorded in that profile's gateway_state.json against the live process table via the existing get_runtime_status_running_pid(). read_runtime_status() gains an optional path arg so a profile's state file can be read without mutating the process-global HERMES_HOME (preserving the contextvar-based profile isolation the dashboard relies on). Backward compatible: every existing caller passes no argument. Tests: a regression test that fails pre-fix (live gateway, lock check returns None -> must still report running) and a guard test that a 'stopped' state file is never reported running even with a live PID.	2026-06-24 16:36:17 +10:00
teknium1	366c2a3766	fix(gateway): propagate fatal-config exit code through start_gateway clean-exit path The contributor PR stamped runner._exit_code=78 on non-retryable startup errors, but start_gateway()'s clean-exit branch returned True before the SystemExit(runner.exit_code) site, so main() exited 0. The s6 finish script's [ "$1" = "78" ] check never matched and s6 crash-looped the gateway anyway — the fix was dead as shipped (#51228). Honor runner.exit_code in the clean-exit branch: raise SystemExit(code) when set, else return True (normal /restart clean exit). Add a start_gateway()-level test that asserts process-level SystemExit(78) propagation — the gap the PR's object-level test missed — plus exit_code on the existing _CleanExitRunner mocks.	2026-06-24 16:34:51 +10:00
Francesco Mucio	776f68e1ee	fix(gateway): exit 78 (EX_CONFIG) on fatal startup errors, s6 finish script stops restart loop Profiles without their own messaging token inherit the default profile's token via os.getenv, hit a token collision, and exit with startup_failed. s6 restarts them immediately, creating ~30MB tirith sandbox dirs in /tmp each cycle — filling the disk in hours (#51228). Changes: - gateway/restart.py: add GATEWAY_FATAL_CONFIG_EXIT_CODE = 78 - gateway/run.py: set exit_code=78 on non-retryable startup errors (token collision, no platforms) - hermes_cli/service_manager.py: add _render_finish_script() that translates exit 78 → exit 125 (s6 permanent failure) - hermes_cli/container_boot.py: write finish script alongside run script during profile registration The s6 finish script pattern follows docker/s6-rc.d/dashboard/finish. Closes #51228	2026-06-24 16:34:51 +10:00
Teknium	d93d0aee83	fix(cron): anchor naive schedule timestamps to configured timezone (#51695 ) A naive ISO timestamp (e.g. 2026-06-22T20:07:00) was anchored to the server's local timezone via dt.astimezone(), but the due-check (get_due_jobs -> _hermes_now()) runs in the CONFIGURED Hermes timezone. When the two diverge (cloud host on UTC with a different timezone: set, or vice-versa) the stored instant lands hours off the user's wall-clock intent, so one-shots never become due and recurring jobs fire at the wrong time. The ticker stays healthy (heartbeat + success markers fresh) because every tick finds nothing due, matching the silent no-fire in #51021. Anchor naive timestamps to _hermes_now().tzinfo so '20:07' means 20:07 on the same clock the scheduler checks against. The legacy _ensure_aware path still treats already-stored naive values as server-local for back-compat. Fixes #51021	2026-06-23 23:29:57 -07:00
Teknium	78e122ae1a	feat(cron): warn when gateway not running on cron create/list (#51696 ) The cron ticker only runs inside the gateway (_start_cron_ticker); there is no standalone cron daemon. When the gateway isn't running, next_run_at passes but jobs never fire and last_run_at stays null — and manual 'hermes cron run' (which bypasses the ticker) appears to work, masking the real cause. This is the most common cron support report (#51038). cron list already warned; extend the same warning to cron create (the moment the user is most likely to hit this) via a shared helper, and add a pointer to 'hermes cron status'. Silent when a gateway is running, so the gateway /cron path is unaffected.	2026-06-23 23:29:50 -07:00
Teknium	c39b2b50ee	fix(tui): stop a cwd package named utils/proxy/ui from crashing the gateway child (#51693 ) Launching Hermes from a directory that ships its own top-level package with a Hermes-internal name (utils/, proxy/, ui/) crashed the gateway/TUI child with an ImportError (exit 1, crash loop): from utils import atomic_replace resolved to the user's package. tui_gateway/entry.py already stripped the relative cwd forms ('' / '.'), but the launch dir also reaches sys.path as its own ABSOLUTE path (venv activation or a project that adds itself to PYTHONPATH), which the strip missed and which sat ahead of the Hermes root. Centralize a hardened guard in hermes_bootstrap.harden_import_path(): drop the relative forms AND force the Hermes source root to the front even when an absolute cwd entry is present. Wire it into tui_gateway/entry.py and acp_adapter/entry.py (both spawn into arbitrary cwds); hermes_cli/main.py and gateway/run.py already insert the root at front. gatewayClient.ts now also exports HERMES_PYTHON_SRC_ROOT for defense in depth.	2026-06-23 23:29:45 -07:00
teknium1	3d56807fbd	fix(gateway): actively reap no-systemd gateway orphan before restart Builds on @wgu9's runtime-tracking fix: now that find_gateway_pids() can see a no-supervisor `gateway restart` runtime, have stop_profile_gateway() fall back to an orphan-aware, profile-scoped reap (SIGTERM then SIGKILL) when the pidfile/runtime record is missing or stale. Closes the duplicate- accumulation path in #51325 — a follow-up restart now kills the prior orphan instead of stacking another listener on :8644. Gated on not supports_systemd_services() so a transient `gateway restart` argv on supervised hosts is never killed. Also adds the AUTHOR_MAP entry for the salvaged contributor.	2026-06-23 23:29:28 -07:00
jeremy gu	044996e403	fix(gateway): track no-systemd restart runtimes	2026-06-23 23:29:28 -07:00
Teknium	d539cd9004	fix(config): write config.yaml as UTF-8 to stop emoji/personality corruption (#51676 ) atomic_yaml_write (and two sibling config writers) called yaml.dump without allow_unicode=True. The default personalities shipped in cli.py contain emoji/kaomoji, so PyYAML escaped astral-plane chars as 8-digit \\UXXXXXXXX sequences inside multi-line double-quoted strings wrapped with \\ line-continuations. Stricter/non-PyYAML parsers, editors, and hand-edits break that structure into unclosed quotes, failing the whole config parse -> silent fallback to defaults -> custom_providers lost. Add allow_unicode=True to the canonical writer plus tui_gateway/server.py and the telegram adapter's atomic config write so config is written as readable UTF-8 with no escape/fold artifacts. Fixes #51356	2026-06-23 23:28:21 -07:00
Teknium	8e7e104521	fix(cron): tell the user TUI/CLI cron jobs are local-only at create time (#51683 ) deliver=origin (or omitted) from a TUI or classic-CLI session produces a job with origin=null, because those sessions never populate the HERMES_SESSION_PLATFORM/CHAT_ID context vars that _origin_from_env reads. The scheduler then resolves no delivery target and skips delivery — the job runs and saves output to last_output, but nothing reaches the user and they only find out by polling cronjob(action='list') (#51568). This is by design (local sessions have no live-delivery channel), so the fix surfaces it instead of silently dropping the intent: - cronjob create now appends an informational notice to its result when a created job resolves to zero delivery targets and the user did not explicitly ask for deliver='local'. The check uses the scheduler's own _resolve_delivery_targets so it accounts for origin, home channels, 'all', and explicit platform targets — no false positives. - PLATFORM_HINTS gains a 'tui' entry (the TUI had none) and the 'cli' hint now states that cron jobs from these sessions are local-only and that deliver must target a gateway-connected platform to notify the user. This stops the agent promising a delivery that never happens. No scheduler/delivery behavior change; no new env var; cron isolation invariant untouched.	2026-06-23 23:27:48 -07:00
Teknium	a39283bf09	test(docker): assert boot migration keeps .env byte-identical across reboots Adds the #51579 regression test the issue asked for: run the real docker_config_migrate.py boot path twice (host-reboot scenario under --restart unless-stopped) and assert $HERMES_HOME/.env survives byte-for-byte and the second boot is a no-op (no re-migration, no new backup). Exercises real migrate_config + real file I/O via subprocess.	2026-06-24 15:23:23 +10:00
LeonSGP43	60d3b8cbce	fix(docker): restore config backups after failed boot migration	2026-06-24 15:23:23 +10:00
teknium1	7f1c278db8	fix(photon): intercept console.log so 'stream interrupted' bursts escalate spectrum-ts routes stream telemetry through @photon-ai/otel's createLogger, which sends severity>=ERROR to console.error and WARN/INFO to console.log. The two lines the health monitor keys off land on different channels: log.error("stream persistently failing") -> console.error (caught), but log.warn("stream interrupted; reconnecting") -> console.log (was missed). The original interception patched console.error only, so the recovering-> degraded escalation counter never saw the interrupt bursts that are the primary silent-inbound symptom. Verified live against spectrum-ts 3.1.0 + @photon-ai/otel: 3 real log.warn('stream interrupted') calls now escalate to degraded -> process.exit(75) -> adapter reconnect. Adds a shared classifyStreamLog() fed by both console.error and console.log, plus a regression test asserting both channels are intercepted.	2026-06-23 21:33:10 -07:00
XU SUN	0952acbf4d	fix(photon): label upstream CatchUpEvents failures	2026-06-23 21:33:10 -07:00
helix4u	06cbc3bae9	fix(photon): recover degraded upstream stream	2026-06-23 21:33:10 -07:00
xxxigm	34bd6a0db5	test(installer): lock Python-fallback propagation into the venv stage (#50769 ) Source-level regression guard (the script only runs on Windows, so there's no runner on Linux CI). Asserts Resolve-AvailablePythonVersion exists, that Install-Venv re-resolves the interpreter before the venv-creation line, and that Test-Python and the resolver share the single $PythonFallbackVersions constant so detection and venv creation can't drift apart again.	2026-06-23 21:33:08 -07:00
pefontana	667a9f5139	fix(update): reuse an existing PATH uv on Termux before pip _ensure_uv_for_termux only checked resolve_uv() (the managed $HERMES_HOME/bin/uv) before falling back to pip, so a uv installed via `pkg install uv` lives on PATH but is invisible to the helper. Combined with the cherry-picked wheel-only fallback, a Termux user with no managed uv still hit `pip install uv`, which has no Android wheel and tried to source-build the Rust crate, OOM-killing low-memory devices. Probe shutil.which("uv") right after the Termux guard and reuse it before pip. Add a regression test that keeps resolve_uv() returning None while a uv exists on PATH and asserts pip is never invoked.	2026-06-23 18:42:05 -07:00
jinhyuk9714	3e508363f7	fix(update): avoid source-building uv on Termux	2026-06-23 18:42:05 -07:00

1 2 3 4 5 ...

6112 commits