Route curator rollback through the same cross-process cron job lock, make save_jobs lock for legacy direct callers without deadlocking nested mutation paths, and harden the regression test so a second _jobs_lock caller really blocks across processes.
`hermes cron pause`/`resume`/`remove` run in their own CLI process (CLI →
cronjob tool → pause_job → update_job → save_jobs), entirely separate from
the gateway process that also writes jobs.json (mark_job_run, advance_next_run,
due-fast-forward in get_due_jobs). The only synchronization was a module-level
`threading.Lock`, which serializes writers *within a single process* but does
nothing across processes — and update_job/pause_job/remove_job/create_job did
not even take it.
The result is a classic lost update: a `cron pause` issued while the gateway is
live loads jobs.json, sets enabled=False, and saves; concurrently the gateway
loads the same file and saves back its run-bookkeeping, clobbering the pause.
The CLI prints "Paused" (it succeeded against its own in-memory copy) but the
job stays enabled and keeps firing, with no error surfaced. The scheduler's
`.tick.lock` flock can't be reused for this — it is held for the entire tick,
including multi-minute agent runs, so a CLI mutation would block for minutes.
Add `_jobs_lock()`: a short-held cross-process advisory file lock (fcntl/msvcrt
flock on `<hermes_home>/cron/.jobs.lock`) layered over the existing in-process
lock, and wrap every load→modify→save critical section with it — create_job,
update_job, remove_job, mark_job_run, advance_next_run, get_due_jobs,
rewrite_skill_refs. The lock degrades to in-process-only if neither fcntl nor
msvcrt is available, preserving prior behaviour. All critical sections are short
(field edits, no agent execution), so contention resolves in milliseconds.
Adds a regression test that proves the lock excludes a second process (an
in-process threading.Lock cannot).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Upgrade the Vite/esbuild surfaces that kept web, ui-tui, and the bootstrap installer on vulnerable esbuild versions, regenerate the root lockfile, and preserve intentional package+lock dependency edits during update lockfile cleanup.
Add a parser-only routing regression that proves raw WhatsApp group JIDs bypass channel-directory resolution and home-channel fallback, include channel_aliases.json in quick state snapshots, harden malformed alias handling, and map Keiron McCammon for release attribution.
send_message(target="whatsapp:<group-jid>") silently delivered to the
configured home DM instead of the requested group. Two gaps:
1. _parse_target_ref had no WhatsApp branch. Group JIDs (<id>@g.us),
user JIDs (<id>@s.whatsapp.net), linked-identity JIDs (<id>@lid), and
broadcast/newsletter JIDs matched no pattern and fell through to
`return None, None, False`, so the caller treated them as
unresolvable and used the home channel. The bridge's /send endpoint
accepts any chatId, so only the tool-side target parsing was at fault.
Add a whatsapp branch that recognizes native JIDs as explicit targets.
The pre-existing '+'-prefixed E.164 path is preserved.
2. WhatsApp groups have no human-friendly name — the channel directory
is regenerated from session data on a timer, so a group shows up as
its raw 18-digit JID and any hand-edit to channel_directory.json is
clobbered on the next rebuild. Add a user-maintained alias overlay
(~/.hermes/channel_aliases.json) re-applied on every build AND every
load, giving durable friendly names and letting a freshly-created
group be pre-named before its first message.
Tests: TestParseTargetRefWhatsAppJID (7 cases) for the parser;
TestChannelAliases (7 cases) for the overlay, plus an autouse fixture
isolating CHANNEL_ALIASES_PATH so a real alias file can't leak into the
existing directory tests.
Keep request dump writes on the shared atomic JSON path, add regression coverage for request body/error/stdout redaction, and map the salvaged contributor email for release attribution.
dump_api_request_debug() masks the provider Authorization header but writes
the request `body` (system prompt, tool defs, context-embedded values) and the
error message raw via atomic_json_write. This path also fires unconditionally
on API errors (not only under HERMES_DUMP_REQUESTS), so any secret surfaced
into context (e.g. an integration token) lands in cleartext at
request_dump_*.json on every failed call.
Run the serialized dump through the existing redact_sensitive_text() scrubber
(already used for logs/tool output) before persisting and before the
HERMES_DUMP_REQUEST_STDOUT print; preserve atomicity via temp-file +
Path.replace. Also add the Notion internal-integration prefix (ntn_) to
_PREFIX_PATTERNS so bare values are caught.
Per SECURITY.md §3.2 this is a redaction (in-process heuristic) hardening, not
a §3.1 vulnerability. Refs #46583.
converse() and converse_stream() were added in boto3 1.34.59. When Hermes
is installed editable into system Python (e.g. Ubuntu 24.04 ships 1.34.46),
the system boto3 takes precedence and calls to converse_stream fail with
AttributeError. Add an early version check in _require_boto3() that raises
a clear RuntimeError with upgrade instructions.
On Linux, systemd spawns core services (cron, nginx, sshd) with
deterministic PIDs and jiffy start_times across reboots. A service can
land on the exact same PID and start_time as a previous gateway, causing
acquire_scoped_lock to mistake it for a live gateway and block startup.
The existing stale-detection paths only covered:
- start_times both non-None and different (clear mismatch)
- start_times both None (macOS/Windows fallback to cmdline check)
The boot-time collision falls through both: times are non-None and
equal, so neither branch fired.
Add a third check: when both start_times are known and match but the
live process fails _looks_like_gateway_process, read its cmdline. If
the cmdline is readable (non-None), we have positive evidence of an
impostor and mark the lock stale. Requiring a readable cmdline keeps the
check conservative — if cmdline is unreadable we do not evict.
Adds an observation_scopes config key (and HINDSIGHT_RETAIN_OBSERVATION_SCOPES
env var) so retained memories can opt into per_tag / all_combinations /
custom scoping instead of Hindsight's default combined pass.
Threaded through _build_retain_kwargs so all three retain paths honor it:
auto-retain and flush-on-switch already use aretain_batch; the tool retain
path is switched from aretain to aretain_batch (functionally equivalent,
aretain just wraps a single-item batch) since aretain doesn't accept the
observation_scopes parameter.
Salvaged commit in this PR is authored by capt-marbles
(andrewdmwalker@gmail.com), a bare gmail that does not auto-resolve in
the check-attribution job. Add the AUTHOR_MAP entry.
The salvaged read-side fix lets a profile resolve the xAI OAuth grant from
the global-root auth store when it has no own providers.xai-oauth block.
But _save_xai_oauth_tokens still wrote rotated tokens only to the active
profile store. Because xAI rotates the refresh_token on every refresh, a
profile that reads root's grant and refreshes it left root holding a now-
revoked refresh token — killing every other profile reading the stale root
grant with invalid_grant once its access token expired (#43589).
Detect the read-from-root case (profile lacks its own providers.xai-oauth
block) and, after the profile save, write the rotated chain back to the
global root too via a best-effort, TOCTOU-safe write-through that reuses
_save_auth_store with an explicit target path. A profile that genuinely
shadows root (has its own block) is left untouched, classic mode is a
no-op, and a failed root write never breaks the profile's own save.
Pairs with the read fallback in the preceding commit so the cross-profile
xAI grant stays coherent in both directions.
* fix(docker): skip per-profile gateway reconciliation in dashboard container
When gateway and dashboard containers share a bind-mounted HERMES_HOME,
both run the cont-init.d profile reconciliation script, which creates
s6-log processes for every persisted profile. These s6-log processes
in different containers race to flock() the same log-directory lock
files under logs/gateways/<profile>/lock, producing repeated
"s6-log: fatal: unable to lock ... Resource busy" errors and a
supervision restart storm.
Add HERMES_SKIP_PROFILE_RECONCILE env var support to container_boot.py
and set it in the official docker-compose.yml dashboard service so the
dashboard container no longer creates per-profile gateway s6 services
it never uses.
* chore(release): map salvaged contributor
* refactor(docker): autodetect dashboard container instead of env-var gate
Replace the HERMES_SKIP_PROFILE_RECONCILE env var with PID 1 argv role
detection. A dashboard-only container never spawns or supervises
per-profile gateways, so the reconcile boot hook now skips itself when
/proc/1/cmdline is the dashboard command — no operator flag to set (or
forget in a hand-written manifest, which would reintroduce the s6-log
flock storm this prevents).
- Extract _strip_container_argv_prefix() shared by the legacy-gateway
and new dashboard detectors (DRY the init/wrapper/hermes peel).
- Add _is_dashboard_container(); gate reconcile main() on it.
- Drop HERMES_SKIP_PROFILE_RECONCILE from code + docker-compose.yml.
- Tests: argv matrix for both roles + main()-level skip/reconcile proof
and a regression that the removed env var is now inert.
Co-authored-by: 895252509 <895252509@qq.com>
---------
Co-authored-by: zhouxiang <895252509@qq.com>
Co-authored-by: Ben <ben@nousresearch.com>
_configured_terminal_cwd and _registered_task_cwd_override carried a
byte-identical sentinel + expanduser + isabs validation tail. Extract it
into _sentinel_free_abs_cwd(raw) so the relative/sentinel rejection rule
lives in one place. Behaviour unchanged (the str() coercion the override
path relied on is preserved in the helper).
The session-cwd fix inserted a registered task/session cwd override step
between the live-cwd and $TERMINAL_CWD fallbacks, but three docstrings still
described the old two-step order — _resolve_base_dir's numbered list was
outright wrong. Update _authoritative_workspace_root, _resolve_base_dir, and
_path_resolution_warning to reflect the actual four-step resolution order.
No behaviour change.
The raw-key-first-then-collapsed override lookup was hand-rolled in three
places with subtly different spellings: terminal_tool's command setup, and
both file_tools._registered_task_cwd_override and _get_file_ops. Since that
exact raw-vs-collapsed invariant is what the session-cwd fix depends on,
keeping three copies invites the drift that caused the original bug.
Add terminal_tool.resolve_task_overrides(task_id) as the single source and
route all three sites through it. Behaviour is unchanged (verified
byte-equivalent across raw/collapsed/isolation/None/subagent inputs).
The supervised `gateway-default` s6 slot runs bare `hermes gateway run`
(no -p) to mean "the root HERMES_HOME profile". But `_apply_profile_override`
falls through its #22502 HERMES_HOME guard for the container root
(/opt/data, whose parent is not `profiles`) and reads the sticky
`active_profile` file. If the user set another profile active (e.g. via
the dashboard), the reserved default gateway gets redirected into that
profile — producing a duplicate gateway for the active profile and no
real default gateway. The profile page and `gateway status` then
correctly report default as "not running" because there genuinely isn't
one.
Guard step 2 (the sticky active_profile fallback) with the existing
HERMES_S6_SUPERVISED_CHILD sentinel that the container run-script already
exports. Supervised named-profile slots pass -p explicitly (step 1, never
reaches step 2); only the bare default slot was affected. Inert outside
the s6 container — the sentinel is never set elsewhere.
Reported in the 'Docker & Profiles & Dashboard' support thread.
The unified machine-dashboard reroute (cmd_dashboard) re-execs a named-profile
dashboard launch as the machine dashboard and dropped HERMES_HOME from the
child env with the comment "so the child binds the machine root". That holds
for a standard install (root == ~/.hermes) but breaks the Docker layout: the
published image sets `ENV HERMES_HOME=/opt/data`, so once HERMES_HOME is unset
the child falls back to $HOME/.hermes = /opt/data/.hermes — an empty,
auto-seeded home.
Two user-visible symptoms, one root cause (reported via support):
1. Dashboard Profiles page shows only an empty `default` — the real
default/oracle/saga profiles live under /opt/data/profiles, but the
rerouted child resolves _get_profiles_root() to /opt/data/.hermes/profiles.
2. The "Update Hermes" button runs `hermes update` inside the container
repeatedly instead of bailing with the docker-update guidance. The Docker
guard keys off detect_install_method(), which reads
$HERMES_HOME/.install_method; the image stamps that at /opt/data, but the
misresolved home has no stamp, no HERMES_MANAGED, and no .git → falls
through to "pip", so the guard never fires.
The reporter's workaround was to bind-mount the host dir at both /opt/data and
/opt/data/.hermes so the two paths converge (at the cost of a self-referential
recursion).
Fix: resolve the machine root explicitly with get_default_hermes_root() and set
it on the child env instead of popping HERMES_HOME. That helper returns the
root for both layouts — ~/.hermes for a standard install, and /opt/data for
Docker (it strips a trailing profiles/<name>). Falls back to the old pop
behaviour only if root resolution raises, so the reroute is never blocked.
Regression tests in test_dashboard_unified_launch.py: the existing standard-
install test now asserts the child carries HERMES_HOME == get_default_hermes_root()
(not absent), and a new test_reexec_pins_docker_machine_root covers the Docker
layout (HERMES_HOME=/opt/data/profiles/oracle → child gets /opt/data). Both
fail against the pre-fix pop behaviour (mutation-verified).
* fix: persist s6 gateway desired state
* chore(release): map salvaged contributor
---------
Co-authored-by: Alfred Smith <alfred@my-cloud.me>
Co-authored-by: Ben <ben@nousresearch.com>
* fix(gateway): chown logs/gateways parent so late-added profiles can log
The per-profile log service script created $HERMES_HOME/logs/gateways/
via 'mkdir -p' but only chowned the leaf logs/gateways/<profile>. When
the first log service boots in root context, the gateways/ parent stays
root:root; every profile registered later runs its log service as the
dropped hermes user, 'mkdir -p' fails with EACCES, and s6-log enters a
sub-second fatal crash-loop flooding the container log. The stage2
recursive heal does not catch it either: it is gated on needs_chown,
which is false when the top-level $HERMES_HOME is already hermes-owned.
Two complementary fixes:
- service_manager._render_log_run: chown the gateways/ parent
(non-recursively) before the leaf chown. Runs on every root-context
boot, so it also heals volumes already poisoned by older images.
- docker/stage2-hook.sh: seed logs/gateways in the as_hermes mkdir -p
block; cont-init runs before any service starts, so the parent
already exists hermes-owned when the first log/run does 'mkdir -p'.
The needs_chown repair loop needs no twin entry: it already chowns
logs/ recursively, which covers logs/gateways.
Fixes#45258
* chore(release): map salvaged contributor
---------
Co-authored-by: tangtaizhong666 <tangtaizhong792@gmail.com>
On Windows, native Python extensions such as _bcrypt.pyd are loaded as
DLLs by any running hermes process. When the installer tries to recreate
the venv (Remove-Item -Recurse -Force "venv"), Windows denies the delete
because the DLL is still mapped into the running process.
Add a taskkill /F /T /IM hermes.exe call before the Remove-Item so any
hermes process tree is stopped first, releasing the file lock. A short
sleep gives the OS time to unload the image before deletion proceeds.
This mirrors the existing force_kill_other_hermes() guard already present
in the --update flow (update.rs), applying the same pattern to the full
reinstall/repair path through install.ps1.
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
The collapsed-pane hover-reveal trigger strip (14px wide, 6px edge
gutter) overlapped the neighboring scroller's 8px .scrollbar-dt
scrollbar, which sits flush with the window edge when the rail panes
are collapsed. Hovering the scrollbar revealed the file browser over
it, and clicks on the overlapped band hit the trigger instead of the
scrollbar thumb.
Widen the edge gutter to calc(0.5rem + 2px) so the strip clears the
scrollbar (rem-coupled to the .scrollbar-dt width) while still
covering the OS window-resize grab area inset.
Part of #44140 (item 2).
Co-authored-by: AIalliAI <285906080+AIalliAI@users.noreply.github.com>
Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
* fix(s6): prevent profile create from auto-starting gateway service
When hermes profile create runs inside an s6 container,
_maybe_register_gateway_service() calls register_profile_gateway()
which creates the service directory and triggers s6-svscanctl -a.
Previously the service always started immediately, causing profiles
that share the main gateway's bot token (e.g. Kanban worker profiles)
to fail with a token-lock conflict and persist gateway_state: running
— becoming zombies that resurrect on every container restart.
Wire the existing start_now parameter through the S6 implementation:
when start_now=False, write a marker file (same pattern as
container_boot.py _register_gateway_slot) so s6-supervise leaves the
service stopped until the user explicitly runs hermes -p <profile>
gateway start.
4 files, +61/-6, 4 new tests (all passing).
* test(docker): wait for gateway running state before restart
---------
Co-authored-by: liuhao1024 <sunsky.lau@gmail.com>
Remove the free Parallel Search MCP path and restore the keyed Parallel backend behavior from before it was introduced.
Also drops the keyless fallback registration/display labeling tests and returns the Parallel SDK pin to the prior version.
GLM-5.2 ships with a 1M (1,048,576) token context window. Without this
entry, Hermes falls through to the generic 'glm' key (202,752 tokens),
under-reporting the context bar and prematurely compressing conversations.
The 1M limit was verified empirically via needle-in-a-haystack retrieval
at 789,240 prompt tokens on api.z.ai/api/coding/paas/v4 — zero errors,
zero truncation, correct retrieval at every tested size (25K through 789K).
Changes:
- agent/model_metadata.py: add 'glm-5.2': 1_048_576 before 'glm' fallback
- hermes_cli/models.py: add glm-5.2 to zai curated models
- hermes_cli/setup.py: add glm-5.2 to setup wizard zai list
- hermes_cli/auth.py: add glm-5.2 to coding plan endpoint probes
- plugins/model-providers/zai/__init__.py: add glm-5.2 to fallback_models
- tests/agent/test_model_metadata.py: context resolution + vendor-prefix tests
The #45966 cross-process coherence guard snapshots a session's on-disk
message_count next to the cached agent and rebuilds the agent when the
count changes. But the snapshot is taken at agent-BUILD time — before
the turn writes its own user + assistant (+ tool) rows — and the cache
entry is never rewritten on a reuse. So this process's OWN turn grows
message_count, and the very next turn sees a mismatch and rebuilds the
agent. That happens every turn, for every conversation, silently
destroying the per-conversation prompt caching the cache exists to
protect (AGENTS.md: prompt caching is sacred).
Add _refresh_agent_cache_message_count(): after a turn completes and the
agent has flushed its rows to the SessionDB, re-baseline the stored count
to the now-current value. The guard then fires ONLY when a DIFFERENT
process changes the transcript — preserving the #45966 fix while keeping
the cache warm for normal single-process operation.
Tests drive the real SessionDB + the real guard condition: 5 consecutive
same-process turns now all REUSE the cached agent (0 before the fix); a
cross-process append still invalidates; and the re-baseline is fail-safe
(no DB, falsy session_id, raising probe, legacy 2-tuple, pending sentinel
all no-op).
The platform-disabled fix landed only in agent.skill_utils.get_disabled_skill_names
(the system-prompt path). Two sibling resolvers still used the old
replace-not-union semantics, so the same skill could be hidden from the
<available_skills> prompt yet reported enabled elsewhere:
- hermes_cli/skills_config.get_disabled_skills (the 'hermes skills config' UI)
returned only the platform list, so a globally-disabled skill showed as
enabled (unchecked) on any platform with a platform_disabled entry.
- tools/skills_tool._is_skill_disabled (gates whether skill_view loads a skill)
ignored the global list when a platform list existed, so a globally-disabled
skill could still be loaded on such a platform.
Both now union the global list with the platform list, matching
get_disabled_skill_names. An explicit empty platform list no longer re-enables
a globally-disabled skill — global disables hold on every platform (#46201).
Also: fix the now-stale get_disabled_skill_names docstring and drop a stray
blank line. Regression tests added for both sites (proven to fail on the old
replace semantics).
build_skills_system_prompt() already resolved _platform_hint but called
get_disabled_skill_names() with no argument, so the resolved platform never
reached the filter and the prompt cache_key varied by platform while the
disabled set did not. Pass _platform_hint or None.
get_disabled_skill_names() also fully ignored the global 'disabled' list once
a platform-specific list was found. Return the union (global | platform) so a
globally-disabled skill stays disabled on every platform.
Salvaged from #46203 by @iborazzi; the unrelated apps/shared/tsconfig.json
ES2023 bump is intentionally dropped (one concern per PR).