Commit graph

6112 commits

Author SHA1 Message Date
kshitij
c42d44cb2f
revert(plugins): restore user dashboard plugin backend API auto-import (#43719) (#51950)
* Revert "refactor(security): centralize non-bundled plugin sources in one constant"

This reverts commit e2bea0abe6.

* Revert "fix(security): restrict dashboard plugin backend import to bundled plugins (#43719)"

This reverts commit 8845f3316c.
2026-06-24 07:46:54 -07:00
kshitij
7fb2027d85
Merge pull request #51881 from NousResearch/fix/29559-compression-abort-on-network-failure
fix(compression): abort + preserve context on transient network summary failure (#29559, #25585)
2026-06-24 19:54:21 +05:30
Elshayib
1a435a6d5d fix(model-switch): prevent custom-provider misattribution in model picker (#48305)
When the current provider is a custom endpoint (custom or custom:*), the model
switch pipeline must NOT auto-switch to a native provider/OpenRouter based on a
static-catalog match. The user explicitly configured their own endpoint and the
same model name may be served there; silently rewriting model.provider destroys
their config.

- detect_static_provider_for_model(): skip the static-catalog scan when the
  current provider is custom/custom:*
- switch_model() Step e: extend is_custom to cover custom:* so the
  detect_provider_for_model() last-resort fallback cannot fire

Salvaged from #48351 by Elshayib (authorship preserved).

Fixes #48305
2026-06-24 19:34:33 +05:30
kyssta-exe
b85c460540 fix(tui): targeted save_config_value for model persistence (#48305)
The TUI model-switch persistence (_persist_model_switch) rewrote the entire
model config block via save_config(), destroying sibling keys the user set
under model: (model_slots, model_fallback, base_url, ...) on every switch.

Use targeted, atomic, comment-preserving save_config_value("model.default" /
"model.provider" / "model.base_url") writes instead, so a model switch only
touches the keys it changes.

Salvaged from #48391 by kyssta-exe (authorship preserved).

Fixes #48305
2026-06-24 19:34:33 +05:30
kshitij
2187fd884c
Merge pull request #51027 from NousResearch/salvage/typed-model-routing
fix(model_switch): route typed configured models off openai-codex (#45006)
2026-06-24 19:32:35 +05:30
kshitijk4poor
1a174dfb50 fix(models): gate openai-codex/xai-oauth soft-accept to family-shaped slugs (#45006)
Completes the #45006 fix. PR-base commit (configured-provider routing) handles
the case where a typed model IS declared in user/custom provider config. This
commit closes the other root: when a typed model is NOT in any config and the
current provider is a soft-accepting one (openai-codex / xai-oauth), the
hidden-model soft-accept (#16172 / #19729) would accept ANY unknown name as a
hidden model — so `qwen3.5-4b` typed on a Codex-default session "succeeded" and
mislabeled the provider as "OpenAI Codex" (the exact reported symptom), then
400'd on the next turn.

Gate the soft-accept to slugs that plausibly belong to the provider's family
(openai-codex -> gpt-/codex-/o1/o3/o4; xai-oauth -> grok-). Family-shaped
unknown slugs are still soft-accepted (preserving the #16172 entitlement-gated
hidden-model intent); unrelated names are rejected with actionable guidance to
pin the right provider via `--provider <slug>` or the picker.

Adds TestCodexSoftAcceptPlausibilityGate (5 tests): unrelated names rejected on
codex/xai, family-shaped hidden slugs still accepted, real catalog models
unaffected. Verified load-bearing.
2026-06-24 19:23:53 +05:30
kshitij
ae20c3fb90
Merge pull request #51025 from NousResearch/salvage/cron-autoreset-override
fix(gateway): consume was_auto_reset so /model survives session auto-reset (#48031)
2026-06-24 19:20:11 +05:30
x7peeps
6879d77d74 fix(gateway): consume was_auto_reset so /model survives session auto-reset
When `/model X` is the FIRST message after an idle/daily/suspended auto-reset,
the slash-command path stores a session model override but leaves
`session_entry.was_auto_reset = True` (it never passes through
`_handle_message_with_agent`, which is where the flag was consumed). On the
NEXT regular message, the auto-reset cleanup block pops the freshly-stored
model/reasoning override BEFORE the flag is consumed — so the switch is
silently lost and resolution falls back to the config default, while the
session DB still shows the switched model (a two-sources-of-truth divergence).

Consume the flag at both sites:
  1. gateway/run.py — capture `was_auto_reset` into a local and set the
     attribute False immediately at the top of the cleanup block, so the
     cleanup can't re-fire on a later message and wipe an override stored
     between turns. Downstream reads use the captured local.
  2. gateway/slash_commands.py — the model path consumes the flag before
     storing the override, so a /model-first-after-auto-reset isn't wiped by
     the next message's cleanup.

Salvaged from #48062 by x7peeps (authorship preserved).

Tests: tests/gateway/test_48031_model_switch_after_auto_reset.py — AST
invariants pinning both consume sites (load-bearing; verified they fail when
either consume is removed). Mirrors the AST-pin approach in
test_35809_auto_reset_clean_context.py. Gateway session/reset suite: 16 passed.

Fixes #48031
2026-06-24 19:12:44 +05:30
kshitij
d68a133458
Merge pull request #51890 from NousResearch/salvage/40695-handoff-watcher-async
fix(gateway): offload handoff-watcher SQLite calls to avoid blocking the async heartbeat (#40695)
2026-06-24 19:10:52 +05:30
kshitij
7634488074
Merge pull request #51889 from NousResearch/salvage/41289-model-cmd-async
fix(gateway): offload Discord /model provider-listing off the event loop (#41289)
2026-06-24 19:06:23 +05:30
kshitijk4poor
ab9134bf16 feat(openviking): add full recall prefetch policy
Salvage of PR #48927 by @ehz0ah, which consolidates OpenViking recall
work from #41706 (@huangxun375-stack), #33260, #49975, and #32444.

Replaces stale background post-turn prefetch warming with synchronous
current-query recall. The old queue_prefetch warmed the PREVIOUS user
message while turn-start recall consumed the CURRENT one, so injected
context was always about the wrong topic.

Changes:
- prefetch() now does session-aware /api/v1/search/search with the
  current query, falls back to /api/v1/search/find on failure
- Contract-safe payloads: limit, score_threshold, context_type,
  session_id — no top_k, no search-body mode, no target_uri
- L2 content reads for items with level=2 or empty abstracts, capped
  at full_read_limit (default 2)
- Local ranking (score + query-token overlap + leaf boost), dedup,
  score threshold, and injected-char budget
- queue_prefetch() is now a no-op (background warming removed)
- Additive batched viking_read: uris param accepts up to 3 URIs
- Per-request timeout support on _VikingClient.get/post/delete
- Removes stale _prefetch_result/_prefetch_thread/_prefetch_generation
  state and _invalidate_prefetch_state()
- Strengthened system_prompt_block guidance

Salvage follow-up fixes:
- Expose all 8 recall config knobs in get_config_schema() (PR #48927
  had removed them; #41706 correctly exposed them). Env vars remain
  as internal mechanism but are now visible in setup wizard.
- Lower default timeout 8s→4s, request_timeout 6s→3s, full_read_limit
  3→2 to reduce per-turn blocking latency.

Co-authored-by: Hao Zhe <haozhe4547@gmail.com>
Co-authored-by: Eurekaxun <eurekaxun@163.com>
2026-06-24 18:53:49 +05:30
liuhao1024
721cf54fb1 fix(gateway): offload /model provider-listing off the event loop (#41289)
The Discord/Telegram /model slash command listed providers synchronously
on the gateway's async event loop. list_picker_providers /
list_authenticated_providers are blocking and can fall through to a
synchronous urllib HTTP fetch when the on-disk provider cache is stale,
freezing the loop for 120-150s -> "application did not respond" and
delayed agent starts.

Port #41304's asyncio.to_thread offload to the current handler location.
The handler moved from gateway/run.py to gateway/slash_commands.py
(_handle_model_command); wrap BOTH blocking call sites so the whole bug
class is covered:

  - picker path        -> list_picker_providers
  - text-fallback path -> list_authenticated_providers

asyncio.to_thread is already idiomatic in this module (and asyncio is
imported), so the loop now stays responsive while the (possibly
network-bound) listing runs on a worker thread.

Adds tests/gateway/test_model_command_async_offload.py asserting the
offload contract at the real handler seam for both paths (mutation-
survivable: reverting either to_thread wrap fails the matching test).

Co-authored-by: kshitijk4poor <82637225+kshitijk4poor@users.noreply.github.com>
2026-06-24 18:40:52 +05:30
r266-tech
f0c5d812b0 fix(gateway): offload handoff watcher SessionDB polling off the event loop
The Discord gateway heartbeat stalled ('Shard ID None heartbeat blocked
for more than N seconds') because _handoff_watcher polled the synchronous,
blocking SQLite-backed SessionDB directly on the asyncio event loop every
2s. Each list_pending/claim/complete/fail call performed blocking disk I/O
on the loop thread, starving the Discord heartbeat coroutine.

Wrap every blocking SessionDB call inside the watcher loop in
asyncio.to_thread(...) so the SQLite work runs on a worker thread and the
event loop (and heartbeat) stays responsive. These four call sites are the
only synchronous self._session_db.* calls inside the watcher loop body.

Adds tests/gateway/test_handoff_watcher_async_db.py asserting the watcher
offloads its SessionDB calls via asyncio.to_thread (mutation-survivable:
reverting any to_thread wrap fails the corresponding assertion).

Fixes #40695

Co-authored-by: kshitijk4poor <82637225+kshitijk4poor@users.noreply.github.com>
2026-06-24 18:40:23 +05:30
kshitijk4poor
ac822e4d36 fix(compression): abort (preserve context) on transient network summary failure (#29559, #25585)
When context compaction's summary generation fails, the compressor's default
path (abort_on_summary_failure=False) drops the middle window and inserts a
static 'summary unavailable' marker — destroying the compacted turns. #29559
reported the field impact: a Connection error at the compaction moment dropped
124->15 messages (110 lost) for a long browser-automation task; #25585 is the
same failure mode (failed summary commits a destructive compaction anyway).

compress() already has an EXCEPTION to the historical drop default: auth
failures (401/403) ALWAYS abort and preserve the session, because rotating into
a placeholder-summary child on a broken credential strands the user. A transient
network/connection error is the same situation in reverse: it WILL recover, and
retrying then is strictly better than discarding context for a momentary blip.

Extend the always-abort carve-out to terminal connection/network failures:
- new _last_summary_network_failure flag, set in _generate_summary's terminal
  failure branch when _is_connection_error(e) (reached only after any main-model
  fallback is exhausted), reset alongside the auth flag;
- compress() aborts when it's set (returns messages unchanged,
  _last_compress_aborted=True), independent of abort_on_summary_failure;
- a network-specific operator warning (distinct from the auth + config-flag
  messages).

Scoped to connection errors only: a generic 500/400 still takes the historical
fallback-drop path (test_non_auth_failure_still_uses_fallback_path stays green).

Tests: network-failure detection + abort-despite-flag-false, both mutation-checked
(removing the flag-set fails detection; removing the carve-out fails the abort).
2026-06-24 18:31:51 +05:30
xxxigm
89540d592b test(cli): cover non-interactive prompt_yes_no fallback
Regression coverage for the desktop gateway-restart hang: prompt_yes_no
returns its default when HERMES_NONINTERACTIVE=1 or on a bare EOFError
(closed/redirected stdin), and still exits on KeyboardInterrupt.
2026-06-24 17:56:30 +05:30
Ben
c93b9f9057 feat(relay): terminal 4401 (opt-out) → clean "Relay disabled" state
Some checks are pending
CI / detect (push) Waiting to run
CI / tests (push) Blocked by required conditions
CI / lint (push) Blocked by required conditions
CI / typecheck (push) Blocked by required conditions
CI / docs-site (push) Blocked by required conditions
CI / history-check (push) Blocked by required conditions
CI / contributor-check (push) Blocked by required conditions
CI / uv-lockfile (push) Blocked by required conditions
CI / docker-lint (push) Blocked by required conditions
CI / supply-chain (push) Blocked by required conditions
CI / osv-scanner (push) Blocked by required conditions
CI / All required checks pass (push) Blocked by required conditions
Deploy Site / deploy-vercel (push) Waiting to run
Deploy Site / deploy-docs (push) Waiting to run
Docker Build and Publish / build-amd64 (push) Waiting to run
Docker Build and Publish / build-arm64 (push) Waiting to run
Docker Build and Publish / merge (push) Blocked by required conditions
Phase 7 Unit 7d-B. When an operator opts an instance OUT of the Team Gateway
relay (Unit 7b deprovision), the connector revokes the per-gateway secret and
closes the gateway's WS with 4401. The reconnect supervisor previously treated
EVERY close as retryable, so the live process spun "retrying 4401" forever and
the dashboard showed a red error — opt-out looked like a failure.

Now a 4401 close that arrives AFTER a successful handshake is recognized as a
terminal credential revocation:

- ws_transport.py: track `_handshake_succeeded` (set when a descriptor is
  received); on a 4401 close after a prior success, latch `auth_revoked` and do
  NOT spawn the reconnect supervisor. A 4401 BEFORE any successful handshake
  stays retryable (cold-start / not-yet-provisioned race, not a revocation).
  New `auth_revoked` property + a websockets-version-safe close-code reader
  (prefers `.rcvd`/`.sent` Close frames; `.code` is deprecated in websockets 13+).
- adapter.py: a revocation monitor turns `transport.auth_revoked` into a clean,
  NON-retryable `relay_disabled` fatal and notifies the gateway's fatal-error
  handler (so the adapter is removed and NOT queued for reconnection — the
  credential is dead until the instance is recreated). Monitor is cancelled on
  disconnect; only started when the transport exposes `auth_revoked` (prod WS).
- run.py: `_handle_adapter_fatal_error` maps the `relay_disabled` code to a
  `disabled` platform_state (not `fatal`/`retrying`).
- web: PlatformsCard renders the `disabled` state with a neutral outline badge,
  a PowerOff icon, and muted (not destructive-red) text + message. New optional
  `status.disabled` i18n string ("Disabled").

Also bundles the Phase 7 contract-doc update (this doc is authoritative in
hermes-agent): docs/relay-connector-contract.md gains an "Author-first
resolution + the account-link (DM) path" section documenting the
multi-tenant-guild rule (D-7.2 — route by authenticated author binding, never by
guild; unlinked → fail-closed), the `/link <code>` DM flow, and the
connector-authoritative opt-out + terminal-4401 behavior this PR implements.

Tests: +2 ws_transport (4401-after-handshake terminal / no-reconnect;
4401-before-handshake stays retryable) and +2 adapter (revocation → non-retryable
relay_disabled fatal + handler fired; no-revocation → no fatal). 138 relay tests
pass (incl. the contract-doc conformance test); ruff clean; web tsc clean.

Phase 7 Unit 7d-B (relay-adapter solo lane). Q17 → Option 2; Option 3 (live
de-register, no recreate) + the restart-re-provision hole deferred post-alpha.
2026-06-24 18:43:01 +10:00
Teknium
3c75e11571
fix(browser): validate agent-browser is runnable, not just present (#51740)
After `hermes update`, a globally-installed agent-browser's npm postinstall
(fixUnixSymlink) re-points the global symlink (e.g. /opt/homebrew/bin/agent-browser)
at our local node_modules binary. The next update wipes node_modules, leaving a
dangling symlink that `which` still reports but exec fails on with exit 127 —
silently breaking every browser tool (#48521).

Root cause is trust-on-presence: shutil.which/Path.exists accept a name that
resolves but won't run. Add hermes_constants.agent_browser_runnable() (resolves
the path + runs --version) and gate all four resolution sites on it:
_find_agent_browser now skips a dead candidate and falls through to the next
working one (extended PATH -> local .bin -> npx), self-healing the dangling link.
dep_ensure/doctor/nous_subscription validate too; doctor warns on a broken link.

Closes #48521.
2026-06-24 00:14:49 -07:00
Chaz Dinkle
abc3662bf6 fix(gateway): detect launchd in /restart service-manager probe (#43475)
On a launchd-managed gateway (macOS), /restart stopped the gateway but
never relaunched it: the handler's service detection checks only
INVOCATION_ID (systemd) and container markers, so under launchd it takes
the detached path and exits 0 — which KeepAlive.SuccessfulExit=false
treats as a deliberate stop. The gateway stays silently dead until a
manual launchctl kickstart.

Detect launchd via XPC_SERVICE_NAME, which launchd sets to the job label
for processes it spawns. The probe deliberately excludes the literal
"0": interactive macOS shells inherit XPC_SERVICE_NAME=0 (a truthy
string), and routing an unsupervised interactive gateway to the service
path would make it exit non-zero with nothing to revive it.

Routing through via_service=True (rather than forcing a non-zero exit
on the detached path) matters: the detached path also spawns a helper
that relaunches the gateway, so exiting non-zero there would have BOTH
the helper and launchd respawn it — two gateways racing for the same
bot tokens. The service path spawns no helper; launchd is the single
respawner.

Fixes #43475. Supersedes the run.py-era probes in #19940/#33393 (the
handler has since moved to gateway/slash_commands.py) and avoids the
double-spawn risk in the exit-code-site approaches (#43498, #43596).
2026-06-24 00:14:25 -07:00
Tranquil-Flow
73a20a6ad6 fix(telegram): clip mid-stream overflow instead of splitting (#48648) 2026-06-24 00:00:46 -07:00
teknium1
ba50787180 test(anthropic-oauth): cover login token-endpoint host + fallback
Add two regression tests for the salvaged #48706 fix:
- login token exchange targets platform.claude.com first
- falls back to console.anthropic.com when the new host is unreachable

Also map the salvaged contributor's noreply email in release.py
AUTHOR_MAP (CI author-map gate).
2026-06-23 23:59:40 -07:00
Teknium
be78fbd70e
Revert "fix(profiles): clone auth.json so OAuth credentials carry to cloned profiles (#51719)" (#51732)
This reverts commit f504aecffe.
2026-06-23 23:58:43 -07:00
justemu
4aa793345e fix(matrix): use member_count as DM signal for named DM rooms
Most Matrix clients auto-set a room name when creating a DM (e.g.
"Alice & Bot" from participant display names), so the old
`is_direct and not has_explicit_name` heuristic classified virtually
all client-created DM rooms as "room", forcing require_mention gating
in legitimate one-on-one DMs.

member_count is now the primary DM signal: <=2 members means the room
is necessarily a 1:1 conversation, regardless of m.direct or an explicit
name. A room that grew to 3+ members but is still in stale m.direct is
still classified as a room (conflict flag set). Falls back to the
m.direct + name heuristic when the count is unavailable.

Also hardens _get_room_member_count with a joined_members API fallback
when the cache-backed state_store is empty.

Salvaged from #48554 by @justemu onto the current plugin adapter path
(gateway/platforms/matrix.py -> plugins/platforms/matrix/adapter.py).

Fixes #48551
2026-06-23 23:57:38 -07:00
Teknium
0ef86febe2
docs(sessions): clarify sessions.json is the gateway routing index, not the session list (#51726)
Users who inspect ~/.hermes/sessions/sessions.json see only gateway entries
(e.g. agent:main:whatsapp:dm:...) and mistake it for the session index that
hermes sessions list / /sessions read — which is actually state.db. Issue
#49361 reported CLI sessions as 'invisible' on this premise.

- gateway/session.py: write a self-documenting _README sentinel at the top of
  sessions.json explaining it's the gateway routing index and that ALL sessions
  (CLI/TUI/gateway) live in state.db; skip _-prefixed keys on load so the
  sentinel never round-trips into a SessionEntry.
- Harden every sessions.json reader against the sentinel: mcp_serve loader,
  gateway/mirror.py, gateway/channel_directory.py all skip _-prefixed keys.
- docs/user-guide/sessions.md: warning callout naming the exact symptom.
- tests: assert prune ignores metadata sentinels; add round-trip coverage.
2026-06-23 23:56:36 -07:00
liuhao1024
7ff48a6291 fix(discord): check pairing store for component button auth
Component button interactions (approve/deny, slash confirm, model
picker, clarify) were not checking the pairing store for authorization.
Users approved via `hermes pairing approve` could send messages and use
slash commands (which go through the gateway authz_mixin), but button
clicks were rejected because `_component_check_auth` only checked
env-var allowlists (DISCORD_ALLOWED_USERS, GATEWAY_ALLOW_ALL_USERS,
etc.) and not the pairing store.

This was a regression from commit f6f363662 which intentionally made
component auth fail-closed when no allowlist is set (security fix for
GHSA-mc26-p6fw-7pp6), but did not account for pairing-based auth.

Fix: add a `PairingStore.is_approved("discord", uid)` check to
`_component_check_auth`, mirroring `authz_mixin._check_authorization`.
The pairing store check runs after all allowlist checks, preserving the
fail-closed behavior for non-paired, non-allowed users.

Fixes #50627
2026-06-23 23:55:18 -07:00
Teknium
0957d77187 test(agent): cover interrupt tool-tail alternation close (#48879)
Regression coverage for the synthetic-assistant close: interrupt after a
successful tool must persist an assistant tail (placeholder when no
delivered text), real delivered text is preserved, and non-interrupted
or non-tool tails are left untouched.
2026-06-23 23:52:28 -07:00
teknium1
53f8386587 test(delegation): regression for bedrock Claude target_model api_mode routing
Asserts resolve_runtime_provider honors target_model over the stale
persisted model.default when choosing the Bedrock dual-path api_mode:
Claude target -> anthropic_messages, Nova target -> bedrock_converse.
Both fail without the #49095 fix.
2026-06-23 23:49:37 -07:00
teknium1
d4be583d98 fix(telegram): raise default command-menu cap to 60 so skills stay visible
The 30-slot default could not fit Hermes's ~50 built-in commands, so
every skill command (and 20 built-ins) were silently dropped from the
Telegram \`/\` menu by default — they only worked when typed manually.
Raising the default to 60 keeps all built-ins plus common skill commands
visible out of the box while staying under Telegram's ~4KB payload limit.
Users can still tune it via platforms.telegram.extra.command_menu.
2026-06-23 23:49:22 -07:00
Thestral
dbe14ce35d feat(gateway): configure Telegram command menu priority
Adds a configurable Telegram BotCommand menu cap and priority list via
platforms.telegram.extra.command_menu (max_commands clamped 1..100;
priority_mode prepend|append|replace). Default cap stays 30; hidden
commands remain invokable when typed and /commands lists the full set.

Salvaged from PR #42021. Cherry-picked onto current main; the original
edited gateway/platforms/telegram.py, now relocated to
plugins/platforms/telegram/adapter.py.
2026-06-23 23:49:22 -07:00
Teknium
f504aecffe
fix(profiles): clone auth.json so OAuth credentials carry to cloned profiles (#51719)
Selective --clone / --clone-from / --clone-config copied .env but not
auth.json, silently dropping the credential pool — including OAuth tokens
(Anthropic `claude /login`, Codex, xAI) that never land in .env. A profile
cloned from an OAuth-authenticated default therefore resolved a different
provider (or none) than the source under provider: auto. --clone-all already
carried auth.json via the full copytree; only the selective path missed it.

Add auth.json to _CLONE_CONFIG_FILES and tighten it to 0o600 after copy,
matching .env semantics.
2026-06-23 23:44:34 -07:00
Teknium
050bd01b7b
fix(dashboard): serve uvicorn on SelectorEventLoop on Windows (#50641) (#51717)
On Windows, start_server() served uvicorn via a bare asyncio.run(_serve()),
which uses the default ProactorEventLoop. uvicorn's socket-serving stack
assumes a SelectorEventLoop on win32 (uvicorn/loops/asyncio.py forces it, and
uvicorn.Server.run threads config.get_loop_factory() into its runner for
exactly this reason). Driving uvicorn on the proactor loop makes
server.startup() bind a socket that never accepts: the dashboard and desktop
backend print "Skipping web UI build" then hang forever with the port
LISTENING but no TCP handshake completing.

Fix is win32-scoped to keep the blast radius minimal: POSIX keeps the exact
asyncio.run(_serve()) it had (its default loop is already a SelectorEventLoop /
uvloop, which is what uvicorn serves on). Only on Windows do we mirror
uvicorn.Server.run and run on the loop factory uvicorn picks, with a fallback
to WindowsSelectorEventLoopPolicy for uvicorn < 0.36.

Fixes hermes dashboard and hermes desktop (the Electron app spawns a
hermes dashboard backend). The gateway symptom in the report has a separate
root cause (no uvicorn) and is not addressed here.
2026-06-23 23:43:24 -07:00
teknium1
901165b5a4 fix(cron): complete plugins.cron_providers rename in 2 missed test files
uperLu's #50958 renamed plugins/cron → plugins/cron_providers but left
two test files patching the now-gone plugins.cron.chronos.verify path,
which would fail collection. Point them at plugins.cron_providers.*.
Add uperLu to release.py AUTHOR_MAP.
2026-06-23 23:39:22 -07:00
uperLu
0d4cecb352 fix(cron): avoid provider package shadowing core cron 2026-06-23 23:39:22 -07:00
Ben
31bced1607 fix(profiles): detect a separate-process gateway in profile status
The dashboard Profiles view showed "Gateway stopped" for a gateway that
is in fact running — while the sidebar status strip and `hermes gateway
status` (CLI) both correctly showed it running. Reported on v0.17.0
running the gateway + dashboard in one Docker container.

Root cause: three liveness surfaces with three detection strengths, all
reading the same `gateway.pid`:

  - `hermes gateway status` -> find_gateway_pids() (process-table scan)
  - sidebar /api/status     -> get_running_pid() + gateway_state.json PID
                               fallback + health-URL probe
  - Profiles view           -> _check_gateway_running() = get_running_pid()
                               ONLY, no fallback

`get_running_pid()` short-circuits to None the moment the runtime lock
(`gateway.lock`) doesn't register as held by the *calling* process —
which is always true when the reader is a separate process from the
gateway (the dashboard is its own s6 service in the container), and also
for any launch-service-managed gateway that left a fresh
`gateway_state.json` but no live PID file. So the Profiles view alone
reported the live gateway as stopped.

Fix: give _check_gateway_running the same fallback the sidebar already
has — after the pid-file/lock check misses, validate the PID recorded in
that profile's gateway_state.json against the live process table via the
existing get_runtime_status_running_pid(). read_runtime_status() gains an
optional path arg so a profile's state file can be read without mutating
the process-global HERMES_HOME (preserving the contextvar-based profile
isolation the dashboard relies on). Backward compatible: every existing
caller passes no argument.

Tests: a regression test that fails pre-fix (live gateway, lock check
returns None -> must still report running) and a guard test that a
'stopped' state file is never reported running even with a live PID.
2026-06-24 16:36:17 +10:00
teknium1
366c2a3766 fix(gateway): propagate fatal-config exit code through start_gateway clean-exit path
The contributor PR stamped runner._exit_code=78 on non-retryable startup
errors, but start_gateway()'s clean-exit branch returned True before the
SystemExit(runner.exit_code) site, so main() exited 0. The s6 finish
script's [ "$1" = "78" ] check never matched and s6 crash-looped the
gateway anyway — the fix was dead as shipped (#51228).

Honor runner.exit_code in the clean-exit branch: raise SystemExit(code)
when set, else return True (normal /restart clean exit). Add a
start_gateway()-level test that asserts process-level SystemExit(78)
propagation — the gap the PR's object-level test missed — plus exit_code
on the existing _CleanExitRunner mocks.
2026-06-24 16:34:51 +10:00
Francesco Mucio
776f68e1ee fix(gateway): exit 78 (EX_CONFIG) on fatal startup errors, s6 finish script stops restart loop
Profiles without their own messaging token inherit the default
profile's token via os.getenv, hit a token collision, and exit with
startup_failed.  s6 restarts them immediately, creating ~30MB tirith
sandbox dirs in /tmp each cycle — filling the disk in hours (#51228).

Changes:
- gateway/restart.py: add GATEWAY_FATAL_CONFIG_EXIT_CODE = 78
- gateway/run.py: set exit_code=78 on non-retryable startup errors
  (token collision, no platforms)
- hermes_cli/service_manager.py: add _render_finish_script() that
  translates exit 78 → exit 125 (s6 permanent failure)
- hermes_cli/container_boot.py: write finish script alongside run
  script during profile registration

The s6 finish script pattern follows docker/s6-rc.d/dashboard/finish.

Closes #51228
2026-06-24 16:34:51 +10:00
Teknium
d93d0aee83
fix(cron): anchor naive schedule timestamps to configured timezone (#51695)
A naive ISO timestamp (e.g. 2026-06-22T20:07:00) was anchored to the
server's local timezone via dt.astimezone(), but the due-check
(get_due_jobs -> _hermes_now()) runs in the CONFIGURED Hermes timezone.
When the two diverge (cloud host on UTC with a different timezone: set,
or vice-versa) the stored instant lands hours off the user's wall-clock
intent, so one-shots never become due and recurring jobs fire at the
wrong time. The ticker stays healthy (heartbeat + success markers fresh)
because every tick finds nothing due, matching the silent no-fire in #51021.

Anchor naive timestamps to _hermes_now().tzinfo so '20:07' means 20:07 on
the same clock the scheduler checks against. The legacy _ensure_aware path
still treats already-stored naive values as server-local for back-compat.

Fixes #51021
2026-06-23 23:29:57 -07:00
Teknium
78e122ae1a
feat(cron): warn when gateway not running on cron create/list (#51696)
The cron ticker only runs inside the gateway (_start_cron_ticker); there
is no standalone cron daemon. When the gateway isn't running, next_run_at
passes but jobs never fire and last_run_at stays null — and manual
'hermes cron run' (which bypasses the ticker) appears to work, masking
the real cause. This is the most common cron support report (#51038).

cron list already warned; extend the same warning to cron create (the
moment the user is most likely to hit this) via a shared helper, and add
a pointer to 'hermes cron status'. Silent when a gateway is running, so
the gateway /cron path is unaffected.
2026-06-23 23:29:50 -07:00
Teknium
c39b2b50ee
fix(tui): stop a cwd package named utils/proxy/ui from crashing the gateway child (#51693)
Launching Hermes from a directory that ships its own top-level package with a
Hermes-internal name (utils/, proxy/, ui/) crashed the gateway/TUI child with
an ImportError (exit 1, crash loop): from utils import atomic_replace resolved
to the user's package.

tui_gateway/entry.py already stripped the relative cwd forms ('' / '.'), but
the launch dir also reaches sys.path as its own ABSOLUTE path (venv activation
or a project that adds itself to PYTHONPATH), which the strip missed and which
sat ahead of the Hermes root.

Centralize a hardened guard in hermes_bootstrap.harden_import_path(): drop the
relative forms AND force the Hermes source root to the front even when an
absolute cwd entry is present. Wire it into tui_gateway/entry.py and
acp_adapter/entry.py (both spawn into arbitrary cwds); hermes_cli/main.py and
gateway/run.py already insert the root at front. gatewayClient.ts now also
exports HERMES_PYTHON_SRC_ROOT for defense in depth.
2026-06-23 23:29:45 -07:00
teknium1
3d56807fbd fix(gateway): actively reap no-systemd gateway orphan before restart
Builds on @wgu9's runtime-tracking fix: now that find_gateway_pids() can
see a no-supervisor `gateway restart` runtime, have stop_profile_gateway()
fall back to an orphan-aware, profile-scoped reap (SIGTERM then SIGKILL)
when the pidfile/runtime record is missing or stale. Closes the duplicate-
accumulation path in #51325 — a follow-up restart now kills the prior
orphan instead of stacking another listener on :8644. Gated on
not supports_systemd_services() so a transient `gateway restart` argv on
supervised hosts is never killed.

Also adds the AUTHOR_MAP entry for the salvaged contributor.
2026-06-23 23:29:28 -07:00
jeremy gu
044996e403 fix(gateway): track no-systemd restart runtimes 2026-06-23 23:29:28 -07:00
Teknium
d539cd9004
fix(config): write config.yaml as UTF-8 to stop emoji/personality corruption (#51676)
atomic_yaml_write (and two sibling config writers) called yaml.dump
without allow_unicode=True. The default personalities shipped in cli.py
contain emoji/kaomoji, so PyYAML escaped astral-plane chars as 8-digit
\\UXXXXXXXX sequences inside multi-line double-quoted strings wrapped
with \\ line-continuations. Stricter/non-PyYAML parsers, editors, and
hand-edits break that structure into unclosed quotes, failing the whole
config parse -> silent fallback to defaults -> custom_providers lost.

Add allow_unicode=True to the canonical writer plus tui_gateway/server.py
and the telegram adapter's atomic config write so config is written as
readable UTF-8 with no escape/fold artifacts.

Fixes #51356
2026-06-23 23:28:21 -07:00
Teknium
8e7e104521
fix(cron): tell the user TUI/CLI cron jobs are local-only at create time (#51683)
deliver=origin (or omitted) from a TUI or classic-CLI session produces a
job with origin=null, because those sessions never populate the
HERMES_SESSION_PLATFORM/CHAT_ID context vars that _origin_from_env reads.
The scheduler then resolves no delivery target and skips delivery — the
job runs and saves output to last_output, but nothing reaches the user
and they only find out by polling cronjob(action='list') (#51568).

This is by design (local sessions have no live-delivery channel), so the
fix surfaces it instead of silently dropping the intent:

- cronjob create now appends an informational notice to its result when
  a created job resolves to zero delivery targets and the user did not
  explicitly ask for deliver='local'. The check uses the scheduler's own
  _resolve_delivery_targets so it accounts for origin, home channels,
  'all', and explicit platform targets — no false positives.
- PLATFORM_HINTS gains a 'tui' entry (the TUI had none) and the 'cli'
  hint now states that cron jobs from these sessions are local-only and
  that deliver must target a gateway-connected platform to notify the
  user. This stops the agent promising a delivery that never happens.

No scheduler/delivery behavior change; no new env var; cron isolation
invariant untouched.
2026-06-23 23:27:48 -07:00
Teknium
a39283bf09 test(docker): assert boot migration keeps .env byte-identical across reboots
Adds the #51579 regression test the issue asked for: run the real
docker_config_migrate.py boot path twice (host-reboot scenario under
--restart unless-stopped) and assert $HERMES_HOME/.env survives
byte-for-byte and the second boot is a no-op (no re-migration, no new
backup). Exercises real migrate_config + real file I/O via subprocess.
2026-06-24 15:23:23 +10:00
LeonSGP43
60d3b8cbce fix(docker): restore config backups after failed boot migration 2026-06-24 15:23:23 +10:00
teknium1
7f1c278db8 fix(photon): intercept console.log so 'stream interrupted' bursts escalate
spectrum-ts routes stream telemetry through @photon-ai/otel's createLogger,
which sends severity>=ERROR to console.error and WARN/INFO to console.log.
The two lines the health monitor keys off land on different channels:
log.error("stream persistently failing") -> console.error (caught), but
log.warn("stream interrupted; reconnecting") -> console.log (was missed).

The original interception patched console.error only, so the recovering->
degraded escalation counter never saw the interrupt bursts that are the
primary silent-inbound symptom. Verified live against spectrum-ts 3.1.0 +
@photon-ai/otel: 3 real log.warn('stream interrupted') calls now escalate
to degraded -> process.exit(75) -> adapter reconnect.

Adds a shared classifyStreamLog() fed by both console.error and console.log,
plus a regression test asserting both channels are intercepted.
2026-06-23 21:33:10 -07:00
XU SUN
0952acbf4d fix(photon): label upstream CatchUpEvents failures 2026-06-23 21:33:10 -07:00
helix4u
06cbc3bae9 fix(photon): recover degraded upstream stream 2026-06-23 21:33:10 -07:00
xxxigm
34bd6a0db5 test(installer): lock Python-fallback propagation into the venv stage (#50769)
Source-level regression guard (the script only runs on Windows, so there's no
runner on Linux CI). Asserts Resolve-AvailablePythonVersion exists, that
Install-Venv re-resolves the interpreter before the venv-creation line, and
that Test-Python and the resolver share the single $PythonFallbackVersions
constant so detection and venv creation can't drift apart again.
2026-06-23 21:33:08 -07:00
pefontana
667a9f5139 fix(update): reuse an existing PATH uv on Termux before pip
_ensure_uv_for_termux only checked resolve_uv() (the managed
$HERMES_HOME/bin/uv) before falling back to pip, so a uv installed via
`pkg install uv` lives on PATH but is invisible to the helper. Combined
with the cherry-picked wheel-only fallback, a Termux user with no managed
uv still hit `pip install uv`, which has no Android wheel and tried to
source-build the Rust crate, OOM-killing low-memory devices.

Probe shutil.which("uv") right after the Termux guard and reuse it before
pip. Add a regression test that keeps resolve_uv() returning None while a
uv exists on PATH and asserts pip is never invoked.
2026-06-23 18:42:05 -07:00
jinhyuk9714
3e508363f7 fix(update): avoid source-building uv on Termux 2026-06-23 18:42:05 -07:00