When the current provider is a custom endpoint (custom or custom:*), the model
switch pipeline must NOT auto-switch to a native provider/OpenRouter based on a
static-catalog match. The user explicitly configured their own endpoint and the
same model name may be served there; silently rewriting model.provider destroys
their config.
- detect_static_provider_for_model(): skip the static-catalog scan when the
current provider is custom/custom:*
- switch_model() Step e: extend is_custom to cover custom:* so the
detect_provider_for_model() last-resort fallback cannot fire
Salvaged from #48351 by Elshayib (authorship preserved).
Fixes#48305
The TUI model-switch persistence (_persist_model_switch) rewrote the entire
model config block via save_config(), destroying sibling keys the user set
under model: (model_slots, model_fallback, base_url, ...) on every switch.
Use targeted, atomic, comment-preserving save_config_value("model.default" /
"model.provider" / "model.base_url") writes instead, so a model switch only
touches the keys it changes.
Salvaged from #48391 by kyssta-exe (authorship preserved).
Fixes#48305
Completes the #45006 fix. PR-base commit (configured-provider routing) handles
the case where a typed model IS declared in user/custom provider config. This
commit closes the other root: when a typed model is NOT in any config and the
current provider is a soft-accepting one (openai-codex / xai-oauth), the
hidden-model soft-accept (#16172 / #19729) would accept ANY unknown name as a
hidden model — so `qwen3.5-4b` typed on a Codex-default session "succeeded" and
mislabeled the provider as "OpenAI Codex" (the exact reported symptom), then
400'd on the next turn.
Gate the soft-accept to slugs that plausibly belong to the provider's family
(openai-codex -> gpt-/codex-/o1/o3/o4; xai-oauth -> grok-). Family-shaped
unknown slugs are still soft-accepted (preserving the #16172 entitlement-gated
hidden-model intent); unrelated names are rejected with actionable guidance to
pin the right provider via `--provider <slug>` or the picker.
Adds TestCodexSoftAcceptPlausibilityGate (5 tests): unrelated names rejected on
codex/xai, family-shaped hidden slugs still accepted, real catalog models
unaffected. Verified load-bearing.
When `/model X` is the FIRST message after an idle/daily/suspended auto-reset,
the slash-command path stores a session model override but leaves
`session_entry.was_auto_reset = True` (it never passes through
`_handle_message_with_agent`, which is where the flag was consumed). On the
NEXT regular message, the auto-reset cleanup block pops the freshly-stored
model/reasoning override BEFORE the flag is consumed — so the switch is
silently lost and resolution falls back to the config default, while the
session DB still shows the switched model (a two-sources-of-truth divergence).
Consume the flag at both sites:
1. gateway/run.py — capture `was_auto_reset` into a local and set the
attribute False immediately at the top of the cleanup block, so the
cleanup can't re-fire on a later message and wipe an override stored
between turns. Downstream reads use the captured local.
2. gateway/slash_commands.py — the model path consumes the flag before
storing the override, so a /model-first-after-auto-reset isn't wiped by
the next message's cleanup.
Salvaged from #48062 by x7peeps (authorship preserved).
Tests: tests/gateway/test_48031_model_switch_after_auto_reset.py — AST
invariants pinning both consume sites (load-bearing; verified they fail when
either consume is removed). Mirrors the AST-pin approach in
test_35809_auto_reset_clean_context.py. Gateway session/reset suite: 16 passed.
Fixes#48031
Salvage of PR #48927 by @ehz0ah, which consolidates OpenViking recall
work from #41706 (@huangxun375-stack), #33260, #49975, and #32444.
Replaces stale background post-turn prefetch warming with synchronous
current-query recall. The old queue_prefetch warmed the PREVIOUS user
message while turn-start recall consumed the CURRENT one, so injected
context was always about the wrong topic.
Changes:
- prefetch() now does session-aware /api/v1/search/search with the
current query, falls back to /api/v1/search/find on failure
- Contract-safe payloads: limit, score_threshold, context_type,
session_id — no top_k, no search-body mode, no target_uri
- L2 content reads for items with level=2 or empty abstracts, capped
at full_read_limit (default 2)
- Local ranking (score + query-token overlap + leaf boost), dedup,
score threshold, and injected-char budget
- queue_prefetch() is now a no-op (background warming removed)
- Additive batched viking_read: uris param accepts up to 3 URIs
- Per-request timeout support on _VikingClient.get/post/delete
- Removes stale _prefetch_result/_prefetch_thread/_prefetch_generation
state and _invalidate_prefetch_state()
- Strengthened system_prompt_block guidance
Salvage follow-up fixes:
- Expose all 8 recall config knobs in get_config_schema() (PR #48927
had removed them; #41706 correctly exposed them). Env vars remain
as internal mechanism but are now visible in setup wizard.
- Lower default timeout 8s→4s, request_timeout 6s→3s, full_read_limit
3→2 to reduce per-turn blocking latency.
Co-authored-by: Hao Zhe <haozhe4547@gmail.com>
Co-authored-by: Eurekaxun <eurekaxun@163.com>
The Discord/Telegram /model slash command listed providers synchronously
on the gateway's async event loop. list_picker_providers /
list_authenticated_providers are blocking and can fall through to a
synchronous urllib HTTP fetch when the on-disk provider cache is stale,
freezing the loop for 120-150s -> "application did not respond" and
delayed agent starts.
Port #41304's asyncio.to_thread offload to the current handler location.
The handler moved from gateway/run.py to gateway/slash_commands.py
(_handle_model_command); wrap BOTH blocking call sites so the whole bug
class is covered:
- picker path -> list_picker_providers
- text-fallback path -> list_authenticated_providers
asyncio.to_thread is already idiomatic in this module (and asyncio is
imported), so the loop now stays responsive while the (possibly
network-bound) listing runs on a worker thread.
Adds tests/gateway/test_model_command_async_offload.py asserting the
offload contract at the real handler seam for both paths (mutation-
survivable: reverting either to_thread wrap fails the matching test).
Co-authored-by: kshitijk4poor <82637225+kshitijk4poor@users.noreply.github.com>
The Discord gateway heartbeat stalled ('Shard ID None heartbeat blocked
for more than N seconds') because _handoff_watcher polled the synchronous,
blocking SQLite-backed SessionDB directly on the asyncio event loop every
2s. Each list_pending/claim/complete/fail call performed blocking disk I/O
on the loop thread, starving the Discord heartbeat coroutine.
Wrap every blocking SessionDB call inside the watcher loop in
asyncio.to_thread(...) so the SQLite work runs on a worker thread and the
event loop (and heartbeat) stays responsive. These four call sites are the
only synchronous self._session_db.* calls inside the watcher loop body.
Adds tests/gateway/test_handoff_watcher_async_db.py asserting the watcher
offloads its SessionDB calls via asyncio.to_thread (mutation-survivable:
reverting any to_thread wrap fails the corresponding assertion).
Fixes#40695
Co-authored-by: kshitijk4poor <82637225+kshitijk4poor@users.noreply.github.com>
When context compaction's summary generation fails, the compressor's default
path (abort_on_summary_failure=False) drops the middle window and inserts a
static 'summary unavailable' marker — destroying the compacted turns. #29559
reported the field impact: a Connection error at the compaction moment dropped
124->15 messages (110 lost) for a long browser-automation task; #25585 is the
same failure mode (failed summary commits a destructive compaction anyway).
compress() already has an EXCEPTION to the historical drop default: auth
failures (401/403) ALWAYS abort and preserve the session, because rotating into
a placeholder-summary child on a broken credential strands the user. A transient
network/connection error is the same situation in reverse: it WILL recover, and
retrying then is strictly better than discarding context for a momentary blip.
Extend the always-abort carve-out to terminal connection/network failures:
- new _last_summary_network_failure flag, set in _generate_summary's terminal
failure branch when _is_connection_error(e) (reached only after any main-model
fallback is exhausted), reset alongside the auth flag;
- compress() aborts when it's set (returns messages unchanged,
_last_compress_aborted=True), independent of abort_on_summary_failure;
- a network-specific operator warning (distinct from the auth + config-flag
messages).
Scoped to connection errors only: a generic 500/400 still takes the historical
fallback-drop path (test_non_auth_failure_still_uses_fallback_path stays green).
Tests: network-failure detection + abort-despite-flag-false, both mutation-checked
(removing the flag-set fails detection; removing the carve-out fails the abort).
hermes-pr-review findings:
- notifyError('runtime-not-ready', msg) misused the (error, fallback) API:
the key became the notification body and the message became the title.
Switch to notify({ id, kind, title, message }) which puts content in the
right slots.
- The stable id 'runtime-not-ready' deduplicates: notify() replaces by id,
so repeated refreshOnboarding calls during an outage no longer stack
up to 4 persistent error toasts.
- Remove dead !state.manual guard from shouldPreserveConfiguredOnFallback:
refreshOnboarding already short-circuits on manual before the helper.
- Test: seed localStorage with '1' before asserting it survives (was testing
the wrong invariant — null in, null out).
- Test: use static import for spy instead of fragile await import.
- Test: add negative case for requested=true + configured=true (should
still downgrade — requested overrides preservation).
When shouldPreserveConfiguredOnFallback keeps configured=true, also call
notifyError('runtime-not-ready', ...) so the user knows the backend wasn't
verified instead of silently proceeding. Adapted from @mohamedorigami-jpg's
approach in PR #37634.
Regression coverage for the desktop gateway-restart hang: prompt_yes_no
returns its default when HERMES_NONINTERACTIVE=1 or on a bare EOFError
(closed/redirected stdin), and still exits on KeyboardInterrupt.
The dashboard/desktop spawn gateway actions with stdin=DEVNULL and
HERMES_NONINTERACTIVE=1 (hermes_cli/web_server.py), but prompt_yes_no
ignored that contract and called sys.exit(1) on the resulting EOFError.
On Windows, `gateway start` asks "Install it now so the gateway starts on
login? [Y/n]" when the scheduled task / startup entry is not yet
installed. Spawned from the desktop app there is no stdin to answer it, so
every desktop-triggered gateway restart aborted at that prompt and the
gateway never started ("Gateway service is not installed").
Fall back to the prompt's default when HERMES_NONINTERACTIVE is set, and
treat a bare EOFError as "accept default" rather than exiting. This lets
the Windows start path proceed unattended (Startup-folder fallback + direct
spawn) while interactive TTY usage is unchanged. Ctrl+C still exits.
The wire contract said hop 1 uses "the agent's existing Nous Portal
access token" but didn't name WHICH of an agent's two identities that is.
A hosted agent never holds an `agent:{instanceId}` OAuth client (that
shape is minted only by the interactive dashboard auth-code grant); its
own outbound portal calls use the bootstrap-session token (client
`hermes-cli-vps`) planted in auth.json on first boot. NAS must resolve
the instance id from either an `agent:{id}` client OR the bootstrap
session (AgentInstance.bootstrapSessionId), not gate on `agent:*` alone —
which 403'd every real hosted-agent provision in prod.
Documents the NAS-side fix (resolveAgentCronInstanceId) so the contract
and the implementation agree.
Phase 7 Unit 7d-B. When an operator opts an instance OUT of the Team Gateway
relay (Unit 7b deprovision), the connector revokes the per-gateway secret and
closes the gateway's WS with 4401. The reconnect supervisor previously treated
EVERY close as retryable, so the live process spun "retrying 4401" forever and
the dashboard showed a red error — opt-out looked like a failure.
Now a 4401 close that arrives AFTER a successful handshake is recognized as a
terminal credential revocation:
- ws_transport.py: track `_handshake_succeeded` (set when a descriptor is
received); on a 4401 close after a prior success, latch `auth_revoked` and do
NOT spawn the reconnect supervisor. A 4401 BEFORE any successful handshake
stays retryable (cold-start / not-yet-provisioned race, not a revocation).
New `auth_revoked` property + a websockets-version-safe close-code reader
(prefers `.rcvd`/`.sent` Close frames; `.code` is deprecated in websockets 13+).
- adapter.py: a revocation monitor turns `transport.auth_revoked` into a clean,
NON-retryable `relay_disabled` fatal and notifies the gateway's fatal-error
handler (so the adapter is removed and NOT queued for reconnection — the
credential is dead until the instance is recreated). Monitor is cancelled on
disconnect; only started when the transport exposes `auth_revoked` (prod WS).
- run.py: `_handle_adapter_fatal_error` maps the `relay_disabled` code to a
`disabled` platform_state (not `fatal`/`retrying`).
- web: PlatformsCard renders the `disabled` state with a neutral outline badge,
a PowerOff icon, and muted (not destructive-red) text + message. New optional
`status.disabled` i18n string ("Disabled").
Also bundles the Phase 7 contract-doc update (this doc is authoritative in
hermes-agent): docs/relay-connector-contract.md gains an "Author-first
resolution + the account-link (DM) path" section documenting the
multi-tenant-guild rule (D-7.2 — route by authenticated author binding, never by
guild; unlinked → fail-closed), the `/link <code>` DM flow, and the
connector-authoritative opt-out + terminal-4401 behavior this PR implements.
Tests: +2 ws_transport (4401-after-handshake terminal / no-reconnect;
4401-before-handshake stays retryable) and +2 adapter (revocation → non-retryable
relay_disabled fatal + handler fired; no-revocation → no fatal). 138 relay tests
pass (incl. the contract-doc conformance test); ruff clean; web tsc clean.
Phase 7 Unit 7d-B (relay-adapter solo lane). Q17 → Option 2; Option 3 (live
de-register, no recreate) + the restart-re-provision hole deferred post-alpha.
After `hermes update`, a globally-installed agent-browser's npm postinstall
(fixUnixSymlink) re-points the global symlink (e.g. /opt/homebrew/bin/agent-browser)
at our local node_modules binary. The next update wipes node_modules, leaving a
dangling symlink that `which` still reports but exec fails on with exit 127 —
silently breaking every browser tool (#48521).
Root cause is trust-on-presence: shutil.which/Path.exists accept a name that
resolves but won't run. Add hermes_constants.agent_browser_runnable() (resolves
the path + runs --version) and gate all four resolution sites on it:
_find_agent_browser now skips a dead candidate and falls through to the next
working one (extended PATH -> local .bin -> npx), self-healing the dangling link.
dep_ensure/doctor/nous_subscription validate too; doctor warns on a broken link.
Closes#48521.
* docs: stop recommending pip install hermes-agent; point to install script
The install script is the only supported install path (it provisions a
managed, isolated uv environment). Replace bare `pip install hermes-agent`
primary-install recommendations with the curl install script, and rewrite
optional-extra snippets (`pip install "hermes-agent[X]"`) to the managed-env
form `cd ~/.hermes/hermes-agent && uv pip install -e ".[X]"` that matches the
installer and the English quickstart.
Covers English docs + zh-Hans mirrors, the achievements plugin README, and
realigns the zh-Hans quickstart to the English Desktop-installer-first layout
(dropping its stale "Method A — pip (simplest)" section).
* docs: drop pip as a supported install/update method
Removes the 'pip installs' supported-method sections from updating.md and
cli-commands.md (EN + zh-Hans): the curl install script is the only supported
way to install/update the Hermes CLI. The _cmd_update_pip pip/pipx branches
remain in code as an undocumented safety net for users who already have such an
install, but the docs no longer advertise pip as a path.
Also normalizes a bare `pip install -e '.[acp]'` to the managed-env form.
Leaves python-library.md untouched: importing AIAgent as a library dependency
into your own project is a distinct use case where pip is correct.
On a launchd-managed gateway (macOS), /restart stopped the gateway but
never relaunched it: the handler's service detection checks only
INVOCATION_ID (systemd) and container markers, so under launchd it takes
the detached path and exits 0 — which KeepAlive.SuccessfulExit=false
treats as a deliberate stop. The gateway stays silently dead until a
manual launchctl kickstart.
Detect launchd via XPC_SERVICE_NAME, which launchd sets to the job label
for processes it spawns. The probe deliberately excludes the literal
"0": interactive macOS shells inherit XPC_SERVICE_NAME=0 (a truthy
string), and routing an unsupervised interactive gateway to the service
path would make it exit non-zero with nothing to revive it.
Routing through via_service=True (rather than forcing a non-zero exit
on the detached path) matters: the detached path also spawns a helper
that relaunches the gateway, so exiting non-zero there would have BOTH
the helper and launchd respawn it — two gateways racing for the same
bot tokens. The service path spawns no helper; launchd is the single
respawner.
Fixes#43475. Supersedes the run.py-era probes in #19940/#33393 (the
handler has since moved to gateway/slash_commands.py) and avoids the
double-spawn risk in the exit-code-site approaches (#43498, #43596).
The dashboard chat sidebar's tool-call activity card was disabled in the
product — both ChatPage mounts passed showTools={false} (since #49077),
so the box never rendered. The sidebar still subscribed to tool.* events
and accumulated them in state for a panel nobody saw.
Remove the tools card, the showTools prop, the tool.* event handling and
state, and the now-orphaned ToolCall component. The /api/events
subscription stays for session.info (live title) and
dashboard.new_session_requested. The sidebar is now just the model
selector box; the session list (ChatSessionList) is unchanged.
No behavior change in the live dashboard — the tools box was already
hidden.
Add two regression tests for the salvaged #48706 fix:
- login token exchange targets platform.claude.com first
- falls back to console.anthropic.com when the new host is unreachable
Also map the salvaged contributor's noreply email in release.py
AUTHOR_MAP (CI author-map gate).
Anthropic migrated the OAuth token endpoint from
console.anthropic.com/v1/oauth/token (now returns HTTP 404) to
platform.claude.com/v1/oauth/token. The token *refresh* path already
iterated both hosts, but the two initial code-exchange call sites were
hardcoded to the dead console host, so every new Claude OAuth login
failed with 'Token exchange failed: HTTP Error 404: Not Found' and saved
no credentials.
Fix the whole bug class:
- Add _OAUTH_TOKEN_URLS [platform.claude.com, console.anthropic.com] in
agent/anthropic_adapter.py; _OAUTH_TOKEN_URL now points at the live
host for backward-compat with existing imports.
- run_hermes_oauth_login_pure() (CLI flow) iterates the list, first
success wins, mirroring the refresh path.
- hermes_cli/web_server.py (desktop dashboard flow) imports the list and
iterates it too, so the GUI login path is fixed identically.
Probe: console.anthropic.com/v1/oauth/token -> HTTP 404 (gone),
platform.claude.com/v1/oauth/token -> HTTP 400 (alive). Verified a real
Claude MAX OAuth login now succeeds end-to-end.
Most Matrix clients auto-set a room name when creating a DM (e.g.
"Alice & Bot" from participant display names), so the old
`is_direct and not has_explicit_name` heuristic classified virtually
all client-created DM rooms as "room", forcing require_mention gating
in legitimate one-on-one DMs.
member_count is now the primary DM signal: <=2 members means the room
is necessarily a 1:1 conversation, regardless of m.direct or an explicit
name. A room that grew to 3+ members but is still in stale m.direct is
still classified as a room (conflict flag set). Falls back to the
m.direct + name heuristic when the count is unavailable.
Also hardens _get_room_member_count with a joined_members API fallback
when the cache-backed state_store is empty.
Salvaged from #48554 by @justemu onto the current plugin adapter path
(gateway/platforms/matrix.py -> plugins/platforms/matrix/adapter.py).
Fixes#48551
Users who inspect ~/.hermes/sessions/sessions.json see only gateway entries
(e.g. agent:main:whatsapp:dm:...) and mistake it for the session index that
hermes sessions list / /sessions read — which is actually state.db. Issue
#49361 reported CLI sessions as 'invisible' on this premise.
- gateway/session.py: write a self-documenting _README sentinel at the top of
sessions.json explaining it's the gateway routing index and that ALL sessions
(CLI/TUI/gateway) live in state.db; skip _-prefixed keys on load so the
sentinel never round-trips into a SessionEntry.
- Harden every sessions.json reader against the sentinel: mcp_serve loader,
gateway/mirror.py, gateway/channel_directory.py all skip _-prefixed keys.
- docs/user-guide/sessions.md: warning callout naming the exact symptom.
- tests: assert prune ignores metadata sentinels; add round-trip coverage.
Component button interactions (approve/deny, slash confirm, model
picker, clarify) were not checking the pairing store for authorization.
Users approved via `hermes pairing approve` could send messages and use
slash commands (which go through the gateway authz_mixin), but button
clicks were rejected because `_component_check_auth` only checked
env-var allowlists (DISCORD_ALLOWED_USERS, GATEWAY_ALLOW_ALL_USERS,
etc.) and not the pairing store.
This was a regression from commit f6f363662 which intentionally made
component auth fail-closed when no allowlist is set (security fix for
GHSA-mc26-p6fw-7pp6), but did not account for pairing-based auth.
Fix: add a `PairingStore.is_approved("discord", uid)` check to
`_component_check_auth`, mirroring `authz_mixin._check_authorization`.
The pairing store check runs after all allowlist checks, preserving the
fail-closed behavior for non-paired, non-allowed users.
Fixes#50627
Regression coverage for the synthetic-assistant close: interrupt after a
successful tool must persist an assistant tail (placeholder when no
delivered text), real delivered text is preserved, and non-interrupted
or non-tool tails are left untouched.
Asserts resolve_runtime_provider honors target_model over the stale
persisted model.default when choosing the Bedrock dual-path api_mode:
Claude target -> anthropic_messages, Nova target -> bedrock_converse.
Both fail without the #49095 fix.
The 30-slot default could not fit Hermes's ~50 built-in commands, so
every skill command (and 20 built-ins) were silently dropped from the
Telegram \`/\` menu by default — they only worked when typed manually.
Raising the default to 60 keeps all built-ins plus common skill commands
visible out of the box while staying under Telegram's ~4KB payload limit.
Users can still tune it via platforms.telegram.extra.command_menu.
Adds a configurable Telegram BotCommand menu cap and priority list via
platforms.telegram.extra.command_menu (max_commands clamped 1..100;
priority_mode prepend|append|replace). Default cap stays 30; hidden
commands remain invokable when typed and /commands lists the full set.
Salvaged from PR #42021. Cherry-picked onto current main; the original
edited gateway/platforms/telegram.py, now relocated to
plugins/platforms/telegram/adapter.py.
The desktop composer threw an uncaught "Composer is not available" at
startup and the input went unresponsive (#49903). assistant-ui's composer
mutators (setText/send/…) throw when the thread's composer core isn't bound
yet; the read path is null-safe but the writes are not. ChatBar pushes draft
text via aui.composer().setText() from mount-time effects (draft restore,
clearDraft, external inserts), and the v0.17.0 popout refactor (#49488)
widened the unbound window by moving the composer out of the contain wrapper
into a sibling of the thread — so the throw surfaced as an uncaught error
that wedged the input.
Wrap every composer mutation in a setComposerText helper that swallows the
unbound-core throw. The contentEditable DOM + draftRef already hold the text
and the draft-editor sync re-applies it once the core attaches, so the draft
is never lost — only the premature state push is skipped.
Selective --clone / --clone-from / --clone-config copied .env but not
auth.json, silently dropping the credential pool — including OAuth tokens
(Anthropic `claude /login`, Codex, xAI) that never land in .env. A profile
cloned from an OAuth-authenticated default therefore resolved a different
provider (or none) than the source under provider: auto. --clone-all already
carried auth.json via the full copytree; only the selective path missed it.
Add auth.json to _CLONE_CONFIG_FILES and tighten it to 0o600 after copy,
matching .env semantics.
On Windows, start_server() served uvicorn via a bare asyncio.run(_serve()),
which uses the default ProactorEventLoop. uvicorn's socket-serving stack
assumes a SelectorEventLoop on win32 (uvicorn/loops/asyncio.py forces it, and
uvicorn.Server.run threads config.get_loop_factory() into its runner for
exactly this reason). Driving uvicorn on the proactor loop makes
server.startup() bind a socket that never accepts: the dashboard and desktop
backend print "Skipping web UI build" then hang forever with the port
LISTENING but no TCP handshake completing.
Fix is win32-scoped to keep the blast radius minimal: POSIX keeps the exact
asyncio.run(_serve()) it had (its default loop is already a SelectorEventLoop /
uvloop, which is what uvicorn serves on). Only on Windows do we mirror
uvicorn.Server.run and run on the loop factory uvicorn picks, with a fallback
to WindowsSelectorEventLoopPolicy for uvicorn < 0.36.
Fixes hermes dashboard and hermes desktop (the Electron app spawns a
hermes dashboard backend). The gateway symptom in the report has a separate
root cause (no uvicorn) and is not addressed here.
uperLu's #50958 renamed plugins/cron → plugins/cron_providers but left
two test files patching the now-gone plugins.cron.chronos.verify path,
which would fail collection. Point them at plugins.cron_providers.*.
Add uperLu to release.py AUTHOR_MAP.
The dashboard Profiles view showed "Gateway stopped" for a gateway that
is in fact running — while the sidebar status strip and `hermes gateway
status` (CLI) both correctly showed it running. Reported on v0.17.0
running the gateway + dashboard in one Docker container.
Root cause: three liveness surfaces with three detection strengths, all
reading the same `gateway.pid`:
- `hermes gateway status` -> find_gateway_pids() (process-table scan)
- sidebar /api/status -> get_running_pid() + gateway_state.json PID
fallback + health-URL probe
- Profiles view -> _check_gateway_running() = get_running_pid()
ONLY, no fallback
`get_running_pid()` short-circuits to None the moment the runtime lock
(`gateway.lock`) doesn't register as held by the *calling* process —
which is always true when the reader is a separate process from the
gateway (the dashboard is its own s6 service in the container), and also
for any launch-service-managed gateway that left a fresh
`gateway_state.json` but no live PID file. So the Profiles view alone
reported the live gateway as stopped.
Fix: give _check_gateway_running the same fallback the sidebar already
has — after the pid-file/lock check misses, validate the PID recorded in
that profile's gateway_state.json against the live process table via the
existing get_runtime_status_running_pid(). read_runtime_status() gains an
optional path arg so a profile's state file can be read without mutating
the process-global HERMES_HOME (preserving the contextvar-based profile
isolation the dashboard relies on). Backward compatible: every existing
caller passes no argument.
Tests: a regression test that fails pre-fix (live gateway, lock check
returns None -> must still report running) and a guard test that a
'stopped' state file is never reported running even with a live PID.