The gateway pre-compression hygiene valve force-compressed any session
crossing 400 messages regardless of token usage. On large-context (1M+)
models doing many short, message-dense turns, a healthy session at ~16%
token usage could hit 400 messages and get force-compressed — and the
compression summary's stale Active Task could then bleed into the next
turn.
The valve's actual purpose is to break a death spiral: when API calls
keep disconnecting on an oversized session, no token-usage data arrives,
the token threshold never fires, and the transcript grows unbounded.
It's a count-based floor for that pathological case only. 400 was tuned
for ~200K-context models and is far too low for modern large-context
sessions. Raise the default to 5000 — still well clear of any death
spiral, but no longer firing on legitimate long conversations.
The value remains fully configurable via compression.hygiene_hard_message_limit.
The OpenAI-compatible API server only enforced a hardcoded cap of 10
concurrent runs on /v1/runs, leaving /v1/chat/completions and
/v1/responses unbounded — a request flood could exhaust CPU, memory,
and upstream LLM quota (#7483).
- Add gateway.api_server.max_concurrent_runs (config.yaml, default 10,
0 disables). No env var.
- Shared concurrency gate across all three agent-serving endpoints,
counting both the chat/responses in-flight counter and the /v1/runs
stream set. Returns OpenAI-style 429 + Retry-After when at the cap.
- Remove the dead hardcoded _MAX_CONCURRENT_RUNS class attribute.
Closes#7483.
Second cleanup pass (simplify-code review of the first follow-up):
- write_runtime_status now clamps active_agents via parse_active_agents
instead of an inline max(0, int(...)). Removes the duplicated clamp the
helper's docstring acknowledged AND closes a write-side ValueError gap
(a non-numeric active_agents previously raised; now degrades to 0).
- hermes_cli/gateway.py draining-status line routes its active-agents count
through parse_active_agents too — the third coercion site of the same
persisted field, now consistent and non-raising with the two HTTP surfaces.
- web_server.py /api/status: the drain-timeout resolver fallback now catches
ImportError specifically and falls back to DEFAULT_GATEWAY_RESTART_DRAIN_TIMEOUT
(a real float) instead of a blanket 'except Exception -> None'. None would
have violated the surfaced field's int/float contract and stripped NAS's
poll-deadline hint silently.
- Dropped a redundant 'if runtime else 0' branch (parse_active_agents already
handles the empty/None case) and tightened the parse_active_agents docstring
to describe the actual single-contract role (write + both reads).
Follow-up cleanups on top of the busy/idle readout (PR #50103):
- web_server.py /api/status reused the single drain-timeout resolver
hermes_cli.gateway._get_restart_drain_timeout() (HERMES_RESTART_DRAIN_TIMEOUT
env -> agent.restart_drain_timeout config -> default) instead of inlining a
third hand-rolled copy of that precedence chain. Also fixes a subtle
divergence: the inline copy used os.environ.get() so a set-but-empty env var
was treated as a value rather than falling through to config; the shared
resolver .strip()s and falls through correctly.
- Added gateway.status.parse_active_agents() and routed BOTH HTTP surfaces
(/api/status and /health/detailed) through it, so the exposed active_agents
field is consistently clamped non-negative. Previously /api/status clamped
while /health/detailed exposed the raw file value, diverging on a corrupt
count.
- Added TestParseActiveAgents covering the shared coercion contract.
Give an external consumer (NAS) a trustworthy, always-reachable busy/idle
readout it can poll before a disruptive lifecycle action (restart,
migrate, stop, auto-update). The dashboard /api/status is the only HTTP
surface guaranteed up on a hosted agent regardless of which gateway
platforms are enabled, and it already reads gateway_state.json.
Add to /api/status (additive, non-breaking):
- active_agents — in-flight gateway-turn count (now refreshed
per-turn by the companion gateway-side commit)
- gateway_busy — running AND active_agents > 0
- gateway_drainable — running and live (a valid begin-drain target)
- restart_drain_timeout — resolved seconds, so the consumer can size its
poll deadline without out-of-band knowledge
(env HERMES_RESTART_DRAIN_TIMEOUT → config
agent.restart_drain_timeout → default)
The busy/drainable contract is defined once in gateway.status
(derive_gateway_busy / derive_gateway_drainable) and consumed by both
/api/status and /health/detailed so the two surfaces can never disagree.
Liveness keys off gateway_running (a live PID/health probe), NEVER
gateway_updated_at — a healthy idle gateway never advances that timestamp.
All derived fields degrade to safe falsy values when the gateway is down
or the status file is absent/corrupt (never a spurious "busy" that would
wedge the consumer). active_sessions (the 5-min DB recency heuristic the
SPA reads) is left exactly as-is — new signal, new fields.
Tests (behaviour contracts, not snapshots): the pure derivation contract
across every running/state/count/liveness combination; /api/status
integration for busy, idle-drainable, draining, down, stale-busy-file,
corrupt-count, and timeout surfacing; and /health/detailed parity.
Follow-up to the salvaged preflight-compression warning:
- Replace silent `except Exception: pass` at all 5 guard call sites
(cli.py x2, gateway/slash_commands.py x2, tui_gateway/server.py) with
`logger.debug(...)` so signature drift in the guard helper isn't hidden.
- tui_gateway/server.py: set the confirm dict's `warning` field to the
merged message (was bare expensive-model text) so it matches
`confirm_message` for any future consumer reading `warning`.
- Add trailing newlines to the two new files.
Adds hermes_cli/context_switch_guard.py mirroring the model_cost_guard
pattern. When a user switches models mid-session (Herm TUI picker, CLI,
or /model on Telegram/Discord), the warning surfaces on the existing
ModelSwitchResult.warning_message path used by the expensive-model
guard if the new model's compression threshold is below the current
session size.
Partial fix for #23767 — addresses only the 'user-facing guardrail
when switching from a high-context provider to a substantially
lower-context provider' slice. The other proposed fixes from that
issue (hard preflight token guard, metadata cache invalidation on
switch, compression safety invariant, oversized tool-output handling)
are out of scope for this PR.
`cronjob(action='run')` (and `hermes cron run`) only set `next_run_at = now`
and returned success, relying on the scheduler ticker to actually execute the
job on its next tick. When no gateway/ticker is running — a CLI-only setup, or
the Windows case in #41037 — the job never executed: `run` reported success,
but `last_run_at` stayed null forever, no output, no delivery.
A manual `run` should actually run. `_execute_job_now` now:
- **claims the job via `claim_job_for_fire`** — the same at-most-once CAS the
scheduler/external-provider fire path uses. This both advances `next_run_at`
for recurring jobs and blocks a concurrently-running gateway ticker from
double-firing the same job; if the claim is lost, the run is skipped (the
tool reports `execution_skipped`). This closes the double-fire race that a
bare `advance_next_run` left open (a tick whose `get_due_jobs` already
captured the job between trigger and advance would still fire it).
- **delegates firing to `run_one_job`** — the single shared
execute→save→deliver→mark body the ticker and external providers use — so
failure delivery, `[SILENT]` handling, and live-adapter delivery stay
identical across paths and can't drift. (The original salvage re-implemented
this sequence inline and had already dropped failure delivery + `[SILENT]`.)
The tool response carries `executed`, `execution_success`, and either
`execution_error` or `execution_skipped`. The `hermes cron run` CLI message no
longer claims "It will run on the next scheduler tick" — it reports the actual
"Ran now: succeeded/failed" outcome (or the skip).
Salvaged from #41130 by @kyssta-exe (authorship preserved); reworked to reuse
`claim_job_for_fire` + `run_one_job` per review rather than re-implementing the
fire sequence inline. Adds tests for the claim-then-fire path, claim-lost skip,
failure reporting, and exception capture.
Fixes#41037
Co-authored-by: kyssta-exe <kyssta-exe@users.noreply.github.com>
The in-process cron ticker (cron/scheduler_provider.py) caught only
`Exception` and logged at DEBUG, so a `SystemExit`/`KeyboardInterrupt`
raised from a misbehaving provider SDK or agent retry path killed the
ticker thread silently. The gateway PROCESS stayed up, so `hermes cron
status` — which only checks `find_gateway_pids()` — kept reporting
"✓ jobs will fire automatically" while no jobs ever fired (#32612,
#32895).
This makes ticker death survivable and detectable:
- The ticker loop now catches `BaseException` and logs at ERROR with a
traceback, so a single bad tick no longer tears the thread down and
the failure is visible in the gateway log.
- The loop records a heartbeat (`cron/ticker_heartbeat`, epoch seconds)
on startup and after every tick — best-effort, never raised into the
loop. Both ticker entry points (the gateway and the desktop fallback
in web_server.py) funnel through `InProcessCronScheduler.start`, so one
heartbeat site covers both.
- `hermes cron status` now reads the heartbeat age: if the gateway is
running but the heartbeat is stale (> 200s, i.e. several missed ~60s
ticks), it reports the ticker as STALLED and suggests a restart instead
of falsely claiming jobs will fire. A missing heartbeat (older build /
never ran) is treated as "unknown", not "dead".
Adds tests for BaseException survival, per-iteration heartbeat recording,
heartbeat round-trip/age, staleness detection, and silent-write-failure.
Salvaged from #49660 (BaseException survival on current structure),
extended with the heartbeat + honest-status reporting that the earlier
(pre-refactor) watchdog PRs #35616 and #33849 proposed.
Fixes#32612Fixes#32895
Co-authored-by: banditburai <promptsiren@gmail.com>
Co-authored-by: sweetcornna <96944678+sweetcornna@users.noreply.github.com>
The terminal backend onboarding step pointed at
/docs/developer-guide/environments, which no longer exists. Point it at
the live docs page /docs/user-guide/configuration#terminal-backend-configuration.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
resolve_spotify_runtime_credentials() called _refresh_spotify_oauth_state()
without a try/except, so a terminal failure (HTTP 400/401, invalid_grant,
refresh_token_reused) raised AuthError but left the dead refresh_token in
auth.json. Every subsequent session re-read and retried the same token over
the network, failing identically each time.
Fix: wrap the refresh call and, when exc.relogin_required is True and a
refresh_token is present, clear the dead OAuth fields (access_token,
refresh_token, expires_at, expires_in, obtained_at) and write a
last_auth_error quarantine marker to auth.json before re-raising. The next
call sees no access_token and fails fast with spotify_access_token_missing —
no network retry — and the user is prompted to re-authenticate.
Mirrors the quarantine pattern already in place for Nous, xAI-OAuth,
Codex-OAuth (#28116, #28118), and MiniMax-OAuth (#28119).
Three state-loss bugs at the compression rotation boundary, fixed together
because they all live in the same ~80-line rotation block:
- #33618: a persistent /goal did not follow the rotation. load_goal does a
flat per-session lookup with no lineage walk, so a goal silently died when
compression minted a fresh child id. Added migrate_goal_to_session() and
call it after the child session is created (move-not-copy: the parent row
is archived as cleared so exactly one active goal row exists).
- #33906/#33907: if the child create_session raised (FK constraint,
contended write), the outer handler only warned and let the agent continue
on the NEW id — which has no row in state.db — producing an orphan session.
Now the rotation rolls agent.session_id back to the still-indexed parent
(reopening it) instead of stranding the conversation on a phantom id.
- #27633: the compaction-boundary on_session_start notification omitted the
platform kwarg, so context-engine plugins saw source=unknown for every
message after the boundary. Forward platform (matching the initial
session-start call in agent_init.py).
Co-authored-by: denisqq <21260182+denisqq@users.noreply.github.com>
Co-authored-by: zccyman <16263913+zccyman@users.noreply.github.com>
Co-authored-by: liuhao1024 <sunsky.lau@gmail.com>
doctor's npm audit hardcoded PROJECT_ROOT/scripts/whatsapp-bridge. In
read-only Docker installs the bridge deps live in the writable HERMES_HOME
mirror (#49561), so node_modules was never found there and the bridge audit
silently skipped. Resolve the dir through the shared
resolve_whatsapp_bridge_dir() helper so doctor audits where deps actually
install. Falls back to the install-tree path if the helper is unavailable.
Follow-up to the salvaged worktree-materialization fix. When a worktree
task has no explicit workspace_path, resolve the anchor from the board's
default_workdir (a git repo) and materialize <repo>/.worktrees/<id> per
task, instead of silently rooting under the dispatcher's CWD (whatever
directory launched the gateway, e.g. the Hermes checkout). If no
default_workdir is configured, raise with a clear message rather than
guessing from CWD.
Adds AUTHOR_MAP entry for the salvaged commit.
The dispatcher treated workspace_kind=worktree as metadata only and never
ran 'git worktree add', so every worktree task ran in the main repo checkout
instead of an isolated worktree — concurrent tasks silently shared one tree
and contaminated each other.
This materializes a real linked worktree at <repo>/.worktrees/<task_id> on
branch wt/<task_id> when resolve_workspace() handles a worktree task, treats a
repo-root workspace_path as shorthand for that location, persists the derived
workspace/branch back onto the task row, and — on rerun/redispatch — detects an
already-materialized linked worktree (via git-common-dir) and reuses it instead
of nesting a second .worktrees/<id> inside it.
In Docker the install tree (/opt/hermes) is read-only, so npm install for
the WhatsApp bridge fails with EACCES. Add resolve_whatsapp_bridge_dir() in
whatsapp_common.py: when the install dir is read-only, mirror the bridge
source into a writable HERMES_HOME location and use that. Both the
adapter and the 'hermes whatsapp' CLI resolve through the shared helper so
the install and runtime paths agree.
Fixes#49561
Review nit (yoniebans): the config.py comment still said compaction is
'lossy: the pre-compaction transcript is discarded, matching Claude Code /
Codex' — leftover from the original destructive design. The shipped behavior
is soft-archive: lossy for the LIVE context (what the model reloads), but the
pre-compaction turns are kept on disk (active=0, compacted=1), searchable via
session_search and recoverable. Comment now says so. Comment-only; no behavior
change.
Context compression today rewrites the message list AND rotates the
session id — it ends the session, forks a parent_session_id child, and
renumbers the title (name -> name #2). That moving identity key is the
root cause of a whole bug cluster: /goal lost (#33618), pending response
lost at the split (#14238), orphan sessions (#33907), TUI sid desync
(#36777), FTS search gaps + duplicate sidebar entries (#45117), null
continuation cwd (#42228), and title-rename dead-ends (#48989). It also
forced a large defensive apparatus (compression lock, contextvar/env/
logging triple-sync, orphan finalization, gateway SessionEntry
re-propagation, tip projection) whose only job is surviving a
mid-conversation id change.
Add a compression.in_place config flag (default False during rollout).
When True, compaction rewrites the transcript and rebuilds the system
prompt but keeps the SAME session_id: no end_session, no child row, no
title renumber, no contextvar/logging re-sync, no memory/context-engine
session-switch. The conversation keeps one durable id for life, like
Claude Code / Codex. Compaction is lossy by design — the pre-compaction
transcript is summarized away, not archived.
The rotation path is unchanged when the flag is off (moved verbatim into
an else branch). Staged rollout: this PR ships the option behind a
default-off flag for live validation; a follow-up flips the default and
deletes the now-redundant rotation machinery, superseding the 14 open
band-aid PRs in this area.
- hermes_cli/config.py: add compression.in_place (default False), documented
- agent/agent_init.py: resolve the flag -> agent.compression_in_place
- agent/conversation_compression.py: branch compress_context() on the flag
- tests/run_agent/test_in_place_compaction.py: in-place invariants +
rotation regression guard + config default
The pre-flush of current-turn messages (#47202) runs in BOTH modes, so no
boundary data loss. Prompt-cache invariant preserved: the system-prompt
rebuild is the same single sanctioned invalidation that already happens
during compaction — no NEW invalidation. Message alternation preserved.
A nous inference_base_url that fails the host allowlist (e.g. a stale
stg-inference-api.nousresearch.com persisted before the allowlist
existed) was only replaced 'if refreshed_url:' — so when the validator
rejected the URL it left the poisoned value in place. The 'falling back
to default' warning fired but never took effect: every subsequent call,
including the auxiliary compression call, kept hitting the dead staging
endpoint and 401'd.
Reset to DEFAULT_NOUS_INFERENCE_URL when validation returns None at both
refresh sites in resolve_nous_runtime_credentials, so a poisoned
auth.json self-heals on the next refresh. The proxy adapter already did
this correctly; this brings the two auth.py sites in line.
* feat(setup): Blank Slate setup mode — minimal agent, opt in to everything
Adds a third first-time setup option alongside Quick Setup and Full Setup.
Blank Slate forces ON only what an agent needs to run — provider & model,
the File Operations toolset, and the Terminal toolset — and turns
everything else OFF, then walks the user through opting each capability
back in.
What it does:
- platform_toolsets.cli = [file, terminal] (explicit, authoritative list)
- agent.disabled_toolsets = every other known toolset (web, browser,
code_execution, vision, memory, delegation, cronjob, skills, image_gen,
kanban, …). Applied last in the resolver, so it overrides the
non-configurable platform-toolset recovery that would otherwise re-add
toolsets like kanban — guaranteeing a true blank slate.
- Optional config features off: compression, memory + user-profile capture,
checkpoints, smart model routing, auto session reset.
- Bundled skills default to NONE (reuses the .no-bundled-skills marker);
offers to seed the full catalog.
- Walks through tools / plugins / MCP / messaging, all opt-in.
Proven end-to-end: with the Blank Slate config, model_tools.get_tool_definitions
emits exactly 6 schemas — patch, process, read_file, search_files, terminal,
write_file. Nothing else reaches the model.
Re-enable later via hermes tools / hermes skills opt-in --sync /
hermes setup agent.
Tests: tests/hermes_cli/test_setup_blank_slate.py (8 tests) pin the writers,
the resolver invariant ({file, terminal}), and the 6-schema end-to-end set.
Docs: getting-started/quickstart.md documents all three setup modes.
* feat(setup): Blank Slate fork — finish minimal, or walk through configs
After applying the minimal baseline (provider/model + file + terminal,
everything else off), Blank Slate now presents a choice instead of always
running the full walkthrough:
1. Start with everything disabled — finish now with the minimal agent.
2. Walk through all configurations — opt in to tools, skills, plugins, MCP,
and messaging.
Provider/model and terminal are still configured first either way (the agent
can't run without them). The finish-now path records the bundled-skill opt-out
so future `hermes update` runs don't re-inject skills. The walkthrough body
moved to a separate _blank_slate_walkthrough() helper.
Tests: TestBlankSlateFork covers both branches (finish-now applies baseline +
skill opt-out and skips the walkthrough; walkthrough path invokes it). Docs
updated to describe the fork.
Salvage of PR #41284 onto current main. Relocates the last 9 inline messaging
adapters (+ satellites: telegram_network, feishu_comment/_rules/meeting_invite,
wecom_crypto, wecom_callback) from gateway/platforms/ into self-contained
bundled plugins under plugins/platforms/<x>/, discovered via the platform
registry. Strips the per-platform core touchpoints from gateway/run.py,
gateway/config.py, hermes_cli/gateway.py, hermes_cli/setup.py, and
tools/send_message_tool.py.
Carries forward the migration fixes (explicit enabled:false honored,
get_connected_platforms forces discovery, plugin is_connected via
gateway.get_env_value, logs --component gateway matches plugins.platforms.*,
matrix hidden on Windows).
Additionally ports config keys main added since the PR base: the matrix
plugin's _apply_yaml_config now also covers allowed_users,
ignore_user_patterns, process_notices, and session_scope (the inline
gateway/config.py matrix block gained these in the 1340 commits the PR sat
open; they would otherwise have been silently dropped on deletion).
Widen the env_float() guard from #48735 across the whole bug class: a
non-numeric value (e.g. a stale .env "HERMES_API_TIMEOUT=abc" or a typo'd
port) raised an unhandled ValueError and crashed adapter/agent init.
Converts 22 genuinely-unguarded first-party int/float(os.getenv()) sites to
the canonical utils.env_int / utils.env_float helpers (the established house
pattern), instead of duplicating per-module helpers or inline try/except:
- gateway/config.py: WECOM_CALLBACK_PORT, BLUEBUBBLES_WEBHOOK_PORT
- gateway/platforms/email.py: EMAIL_IMAP/SMTP_PORT, EMAIL_POLL_INTERVAL
- gateway/platforms/feishu.py: dedup cache + text/media batch settings
- gateway/platforms/wecom.py, discord/adapter.py: text batch delays
- gateway/platforms/telegram.py: media batch delay, TELEGRAM_WEBHOOK_PORT
- gateway/platforms/whatsapp.py: WHATSAPP_NPM_INSTALL_TIMEOUT
- hermes_cli/auth.py: CODEX/XAI refresh timeouts
- agent/chat_completion_helpers.py: API/stream read/stale timeouts
- run_agent.py, agent/auxiliary_client.py: API + nous timeouts
Sites already guarded by try/except or local helpers are left untouched.
The HERMES_MAX_ITERATIONS sites are already guarded on main via
_current_max_iterations(), so they are not included.
The god-file Phase 4 refactor (094aa85c37) moved agent construction into
CLIAgentSetupMixin, which set the atexit shutdown reference with a bare
`global _active_agent_ref`. After extraction that global binds the *mixin
module's* namespace, not cli.py's. cli._run_cleanup reads
cli._active_agent_ref to decide whether to fire the memory provider's
on_session_end hook — and it stayed None for the whole session, so the
`if _active_agent_ref:` branch was dead and on_session_end never ran on
/exit. Custom memory providers silently lost end-of-session extraction.
Fix: publish the reference onto the cli module explicitly
(`import cli as _cli; _cli._active_agent_ref = self.agent`), using the
deferred-import pattern already established in the mixin.
Regression test asserts cli._active_agent_ref is populated by the mixin's
publish line and guards against a relapse to the bare `global` form. The
existing shutdown tests passed only because they hand-assigned the ref,
which is exactly what masked this.
Behavior-preserving cleanups on the managed-node resolver:
- Hoist _candidate_node_command_names() out of the inner dir loop in
find_hermes_node_executable (computed once, not per directory).
- Drop redundant os.environ.copy() at the two with_hermes_node_path(
os.environ.copy()) sites \u2014 the helper already copies os.environ when
called with no argument (verified env-equivalent).
- Add reciprocal keep-in-sync comments between iter_hermes_node_dirs()
(hermes_constants.py) and hermesManagedNodePathEntries() (electron
main.cjs), which mirror the same platform-ordering rule across the
Python/Node boundary.
The `hermes update` desktop-rebuild gate still used a bare
`shutil.which("npm")` presence check. On a Windows box where the only
working npm is the Hermes-managed npm.cmd (not on PATH), the gate would
skip the desktop rebuild even though _build_web_ui / cmd_gui can now find
it via find_node_executable. Route the gate through the same resolver for
full bug-class coverage.
Surfaced during review of #49239.
Second review pass (Codex + Hermes subagent). Codex reproduced a real race with
a two-thread harness; both converged on the remaining issues.
- Generation-aware publish (fixes a lost-update race): two refresh callers (the
late-refresh daemon and the between-turns prologue around turn 1) could each
compute a snapshot outside the lock; a SLOWER caller holding an OLDER registry
generation could acquire the publish lock after a newer caller and clobber it,
deleting just-landed tools. refresh_agent_mcp_tools now captures
registry._generation before computing and refuses to publish a stale set;
agent._tool_snapshot_generation tracks the published generation.
- Context-engine routing names (_context_engine_tool_names) are now staged on a
local and published atomically with the snapshot, and only claimed when this
rebuild actually appended the schema — matching agent_init's dedup so a
registry/plugin tool of the same name keeps its own dispatch. (Previously
mutated live, before the publish lock, and on no-change refreshes.)
- CLI /reload-mcp: self.enabled_toolsets is resolved once at startup, so a
server newly ENABLED in config mid-session wasn't picked up (TUI already
re-resolved). Merge now-connected MCP server names into the override (unless
the user pinned all/*), mirroring startup, and keep self.enabled_toolsets in
sync. Closes the CLI/TUI parity hole.
- ACP (acp_adapter/server.py) routed through the shared helper — it was a 5th
sibling rebuild that re-injected memory tools but NOT context-engine tools and
bypassed the atomic/name-diff path (inert today, fragile).
- mcp_startup._resolve_discovery_timeout pulls its default from DEFAULT_CONFIG
(single source of truth) instead of a stale hardcoded 5.0 literal.
- Tests: stale-generation-no-clobber, _skip_mcp_refresh honored, timeout
fallback uses DEFAULT_CONFIG.
Consolidated findings from three independent reviewers (Codex, Claude Code, a
Hermes subagent w/ the hermes-agent-dev skill):
- BLOCKING: refresh_agent_mcp_tools rebuilt only the registry subset, silently
dropping post-build-injected memory-provider (mem0/honcho/…) and context-
engine (lcm_*) tools on every refresh. Now additive-preserving: re-applies
the same injectors agent_init uses, staged on locals and published atomically.
- Re-injection now honors the #5544 enabled_toolsets gate for context-engine
tools, so a restricted-toolset platform can't get lcm_* leaked back in.
- Atomic read-diff-publish under one lock: the returned `added` set and the
(tools, valid_tool_names) pair are consistent even under concurrent callers
(no half-swap, no TOCTOU).
- background_review fork opts out (_skip_mcp_refresh) so its byte-identical
tools[] cache parity with the parent is preserved.
- CLI /reload-mcp routed through the shared helper (was a 4th divergent copy
with the same clobber bug + missing disabled_toolsets).
- Explicit reloads (TUI RPC + CLI) pass enabled_override so a server the user
just enabled in config this session is picked up; automatic paths reuse the
agent's build-time selection.
- mcp_discovery_timeout default 5.0 -> 1.5s: correctness now comes from the
between-turns refresh, so the startup wait is only a small turn-1 UX bump
rather than a heavy dead-server latency penalty.
- has_registered_mcp_tools checks registered TOOLS (not connected servers) so a
zero-tool/prompt-only server doesn't make the per-turn hook fire forever.
- Tests: rewrote the thread-safety test to actually exercise the write path
(alternating tool sets), added the #5544-gate regression, the memory/context
preservation regression, and a "callable next turn via valid_tool_names"
contract; removed a dead monkeypatch line.
MCP servers that connect after the agent's one-time tool snapshot were
invisible for the whole session. Two root causes, fixed together:
1. The startup discovery wait was a flat 0.75s. HTTP/OAuth servers
commonly take 2-6s on a cold connect, so they missed the window and
their tools never entered the agent's snapshot. `thread.join(timeout)`
already returns the instant discovery completes, so raising the bound
costs ~0s for the common case (no MCP / fast servers) and only ever
blocks for a genuinely-pending server, capped so a dead server can't
freeze startup. The bound is now configurable via
`mcp_discovery_timeout` (config.yaml, default 5.0s).
2. Three call sites duplicated the agent tool-snapshot rebuild (the TUI
`reload.mcp` RPC, the gateway reload, and the TUI late-binding refresh
thread), and the late-refresh detected changes by tool COUNT — missing
an equal-size add/remove swap. Consolidated into one shared
`tools.mcp_tool.refresh_agent_mcp_tools(agent)` helper that diffs by
tool NAME, mutates the agent under a lock (thread-safe), and respects
the agent's own enabled/disabled toolsets.
The late-binding refresh keeps its pre-first-turn cache-safety guard:
it never rebuilds the tool list once a turn has started, so the cached
prompt prefix is never invalidated mid-conversation.
Tests: new tests/tools/test_refresh_agent_mcp_tools.py covers the
name-based diff, in-place mutation, agent-scoped filtering, thread
safety, and the config-driven discovery bound (incl. instant-return
when nothing is pending). 75 passed across the touched areas.
Manual verification surfaced a second bypass class beyond the standalone
config loaders: several code paths bridge config.yaml values into os.environ
(HERMES_TIMEZONE, HERMES_REDACT_SECRETS, HERMES_MAX_ITERATIONS, TERMINAL_*,
network.force_ipv4, ...) by reading the raw user YAML, so the env the whole
process reads carried the USER's value even when an administrator pinned it —
e.g. a managed timezone was overridden because gateway/run.py wrote the user's
timezone into HERMES_TIMEZONE, and _resolve_timezone_name() checks the env var
first.
Wired the shared apply_managed_overlay() into every config→env bridge:
- gateway/run.py module-level startup bridge (timezone, redact_secrets,
max_turns, terminal, display, gateway.strict, ...)
- gateway/run.py _reload_runtime_env_preserving_config_authority (the per-turn
re-bridge that keeps config authoritative over reloaded .env — must keep
MANAGED authoritative on every turn, not just startup)
- hermes_cli/main.py early security.redact_secrets / network.force_ipv4 bridge
(runs before load_config is usable, at import time)
- hermes_cli/send_cmd.py top-level scalar config→env bridge
Verified end-to-end against a writable managed dir (12/12 checks incl. timezone,
logging, model, skin, gateway settings, write-guard) and in a clean process the
gateway per-turn bridge writes HERMES_TIMEZONE=<managed>. Adds an
order-independent regression test for the bridge overlay.
The skin bug was one instance of a class: several subsystems build their
config dict directly from config.yaml instead of routing through
hermes_cli.config.load_config (which carries the managed merge), so they
silently ignored administrator-pinned values. Audited every config.yaml
reader and fixed the behavioral-read bypasses:
- gateway/config.py load_gateway_config (messaging gateway: session_reset,
quick_commands, stt, model, ...)
- gateway/run.py _load_gateway_config (its read_raw_config fast path also
skipped the merge — read_raw_config returns raw user YAML)
- tui_gateway/server.py _load_cfg (new TUI + desktop backend: skin,
reasoning_effort, service_tier, provider_routing)
- cron/scheduler.py (scheduled-job model/reasoning/toolsets/provider_routing)
- hermes_logging.py (logging.level/max_size_mb/backup_count)
- hermes_time.py (timezone)
- hermes_cli/doctor.py (memory-provider diagnostic reads effective config)
All route through a new shared managed_scope.apply_managed_overlay() helper
that mirrors _load_config_impl (env-only expansion so a user ${VAR} can't
shadow a managed literal, root-model-string normalization, leaf-merge) and is
fail-open. cli.py's earlier inline fix is refactored onto the same helper.
Write-back paths (slash_commands, telegram/yuanbao dm_topics, profile
distribution) are deliberately left reading raw user YAML — overlaying managed
values there would persist them into the user file. The dashboard
(web_server.py) already routes through load_config and needed no change.
TUI loader caches the RAW config so _save_cfg never writes managed values to
disk. Adds test_managed_scope_overlay.py (helper) and
test_managed_scope_loaders.py (per-surface integration); mutation-checked.
- show_config prints an administrator header naming the managed source and
lists the pinned config/env keys when a scope is active (silent otherwise).
- hermes doctor gains a managed_scope_check under Configuration Files that
reports the resolved managed dir + pinned key counts, and flags a
HERMES_MANAGED_DIR redirect (the documented foot-gun).
- set_config_value hard-rejects a managed config key (D2) and names the
source, exiting non-zero.
- save_env_value / remove_env_value refuse a managed env key.
- save_config strips managed leaves from a bulk write (mechanical safety net)
with a warning, so the unmanaged remainder still persists.
New _strip_dotted_keys helper drives the bulk-save pruning. All guards are
distinct from and layered after the existing is_managed() package-manager
write-lock.
load_hermes_dotenv now loads the managed-scope .env after user/project .env
and external secret sources, with override=True, so managed env values beat
the user .env and any pre-existing shell export. Reuses the existing dotenv
fallback + credential-sanitization path. Fail-open: no managed dir/.env is a
no-op and any error is swallowed so managed scope never blocks startup.
_load_config_impl now deep-merges the managed config.yaml on top of the
expanded user config so managed leaves win while sibling keys stay
user-controlled (leaf-level merge, D3). Managed values are expanded against
the process env only, never user-defined ${VAR}, so a user can't shadow a
managed literal. The managed file's (mtime,size) is folded into the load
cache key so editing it invalidates the cache. This inverts the usual
env-over-config precedence for pinned keys by design (see design doc §4.1).
New hermes_cli/managed_scope.py resolves a system-level managed directory
(HERMES_MANAGED_DIR override > /etc/hermes), parses managed config.yaml/.env
with fail-open semantics, and exposes is_key_managed/is_env_managed helpers.
The system default is ignored under pytest and HERMES_MANAGED_DIR is added to
the conftest env scrub so a real managed scope can't leak into the suite.
Not wired into the load paths yet (Phases 2-3).
_terminate_reclaimed_worker early-returned on ProcessLookupError with
terminated=False. The new reclaim-defer guard reads that as 'worker
survived the kill' and defers the reclaim forever, so a stale task whose
worker is already dead never lands in result.stale. ProcessLookupError
means the process is gone — that IS a successful termination. Split it
from the generic OSError branch and set terminated=True.
release_stale_claims and detect_stale_running call _terminate_reclaimed_worker
and then release the task claim unconditionally, even when the termination did
not actually kill the worker. _terminate_reclaimed_worker already reports this
via its "terminated" flag, but the callers ignore it.
When a worker is parked in uninterruptible (D) state — for example throttled by
a cgroup memory.high limit — a pending SIGTERM/SIGKILL cannot be delivered until
the throttle lifts, so the kill is a no-op. The dispatcher then frees the claim
and spawns a fresh worker beside the still-alive one. Repeated every dispatch
tick this accumulates duplicate workers without bound, deepening the memory
pressure that caused the throttle in the first place — a self-reinforcing
runaway.
Fix: gate both automatic reclaim paths on _worker_survived_termination(). When
we attempted to kill our own host-local worker and it is still alive, defer the
reclaim (_defer_reclaim_for_live_worker extends the claim a short grace and
emits a reclaim_deferred event) instead of releasing. This guarantees at most
one live worker per task and is self-correcting: not spawning a duplicate is
what relieves the pressure so the pending signal lands and the worker dies, and
the next tick reclaims cleanly. Non-host-local claims and the operator-driven
reclaim_task() path keep their existing force-release behaviour.
Related: #41448 (concurrent dispatchers amplify this by doubling reclaim
frequency); #42858 (kill the worker rather than orphan it on archive).
Tests: defer-when-worker-survives, reclaim-when-killed,
release-when-not-host-local, and the detect_stale_running path.
PR #49056 set the default to 0, which reverts the #45592 idle-clock fix:
without a periodic invalidate, prompt_toolkit stops repainting the bottom
chrome during idle and the status bar goes stale/disappears after a turn.
Restore 1.0 as the default for everyone. The config knob stays — users on
emulators where the per-second redraw fights auto-scroll (#48309) can set
display.cli_refresh_interval: 0 to opt out.
- _guard_named_profile_under_multiplexer: when the default gateway is running
with gateway.multiplex_profiles=on, a named-profile 'hermes gateway run' hard
-errors (pointing at the multiplexer) instead of double-binding that
profile's platforms. Inert unless all hold: this invocation is a named
profile, a default-profile gateway is alive, and its config has multiplexing
on. --force overrides. Wired into run_gateway's guard chain.
- write_runtime_status gains served_profiles: the secondary-adapter startup
records [active] + multiplexed profiles into runtime_status.json so
'hermes status' can show per-profile coverage without a second probe. Absent
for single-profile gateways.
Tests: served_profiles round-trips and is absent by default; guard is inert for
the default profile / under --force / when no default gateway is running.
The credential gate. When multiplexing is active, a profile's secrets resolve
from a context-local scope, never the process-global os.environ (which in a
multiplexer may hold another profile's keys, and is inherited by every
subprocess spawned with env=dict(os.environ)).
- agent/secret_scope.py: get_secret() backed by a secret-scope contextvar.
FAIL-CLOSED: when multiplex is active and no scope is installed, an unscoped
read RAISES UnscopedSecretError instead of falling back to os.environ — a
missed/new call site crashes loudly at that line rather than leaking a
cross-profile value. Genuinely-global vars (HERMES_*, PATH, kanban paths,
…) keep reading os.environ via an allowlist. load_env_file/build_profile_
secret_scope parse a profile .env into an isolated dict WITHOUT mutating
os.environ. Off by default => transparent os.getenv behavior.
- hermes_cli/runtime_provider.py: all credential/provider/base-url reads go
through _getenv -> get_secret.
- agent/credential_pool.py: env fallbacks route through get_secret (the
~/.hermes/.env-first preference is preserved and already profile-correct via
the home override).
- tools/mcp_tool.py: MCP config interpolation resolves through
get_secret, so a server's picks up the routed profile's value.
- gateway/run.py: set_multiplex_active() at GatewayRunner init; per-turn .env
reload is a no-op for credentials in multiplex mode (secrets come from the
scope, not global env); _profile_runtime_scope context manager combines the
HERMES_HOME override + secret scope; _run_agent wraps _run_agent_inner in
that scope (resolved via _resolve_profile_home_for_source) when multiplexing.
Propagates into the agent worker thread for free via the existing
copy_context() in _run_in_executor_with_context.
Tests: 13 unit (fail-closed, scope isolation, global allowlist, .env parsing
without environ mutation) + 7 E2E (runtime_provider + MCP interpolation prove
two profiles isolated, unscoped read raises, globals still read environ).
Foundations for serving multiple profiles from one gateway process, inert
when off:
- gateway.multiplex_profiles config flag (default false), round-trips through
GatewayConfig and load_gateway_config (top-level + nested gateway.* form).
- hermes_cli.profiles.profiles_to_serve(multiplex): the single chokepoint for
which (profile, HERMES_HOME) pairs the gateway serves. Lightweight dir scan;
active-profile-only when off, default + all named profiles when on.
- build_session_key gains a profile= namespace slot. Default/None reuse the
historical 'agent:main:...' literal BYTE-IDENTICALLY (no session migration,
positional parsers unaffected); a named profile becomes 'agent:<profile>:...'
so two profiles on the same platform/chat never collide.
- SessionStore._resolve_profile_for_key + _session_key_for_source fallback
resolve the namespace from the flag (legacy when off, active profile when on).
Tests: byte-identical-when-off (parametrized), namespace isolation, positional
layout preserved, config round-trip, profiles_to_serve enumeration.