Matches the env= callsite convention at the other sanitized
subprocess spawns (cua_backend dict(os.environ), gateway
os.environ.copy()). Functionally equivalent — _sanitize_subprocess_env
never mutates its input — but avoids handing the live mapping to the
helper.
Follow-up to salvaged PR #49207.
Cron no_agent and pre-check scripts ran with the full gateway/agent
environment, allowing scripts under HERMES_HOME/scripts/ to read provider
credentials. Apply _sanitize_subprocess_env like terminal and MCP paths
(SECURITY.md section 2.3).
Add regression test asserting blocklisted provider vars are absent in the
child process.
The skin bug was one instance of a class: several subsystems build their
config dict directly from config.yaml instead of routing through
hermes_cli.config.load_config (which carries the managed merge), so they
silently ignored administrator-pinned values. Audited every config.yaml
reader and fixed the behavioral-read bypasses:
- gateway/config.py load_gateway_config (messaging gateway: session_reset,
quick_commands, stt, model, ...)
- gateway/run.py _load_gateway_config (its read_raw_config fast path also
skipped the merge — read_raw_config returns raw user YAML)
- tui_gateway/server.py _load_cfg (new TUI + desktop backend: skin,
reasoning_effort, service_tier, provider_routing)
- cron/scheduler.py (scheduled-job model/reasoning/toolsets/provider_routing)
- hermes_logging.py (logging.level/max_size_mb/backup_count)
- hermes_time.py (timezone)
- hermes_cli/doctor.py (memory-provider diagnostic reads effective config)
All route through a new shared managed_scope.apply_managed_overlay() helper
that mirrors _load_config_impl (env-only expansion so a user ${VAR} can't
shadow a managed literal, root-model-string normalization, leaf-merge) and is
fail-open. cli.py's earlier inline fix is refactored onto the same helper.
Write-back paths (slash_commands, telegram/yuanbao dm_topics, profile
distribution) are deliberately left reading raw user YAML — overlaying managed
values there would persist them into the user file. The dashboard
(web_server.py) already routes through load_config and needed no change.
TUI loader caches the RAW config so _save_cfg never writes managed values to
disk. Adds test_managed_scope_overlay.py (helper) and
test_managed_scope_loaders.py (per-surface integration); mutation-checked.
Two small, focused fixes for the cron scheduler and checkpoint manager.
1. _summarize_cron_failure_for_delivery (cron/scheduler.py):
Replaces the raw error dump in _process_job with a compact
pattern-matched summary. Provider rate limits, timeouts, and
authentication errors now produce a short human-readable message
instead of dumping multi-KB provider JSON into the delivery channel.
2. _repair_bare_repo_dirs (tools/checkpoint_manager.py):
Recreates refs/heads/ and branches/ directories after git gc
--prune=now, which can remove empty dirs from bare repos and cause
subsequent git add -A to fail with 'fatal: not a git repository'.
Called after all four git gc call sites.
Both fixes use only standard library imports and plug into existing
call sites with no architectural changes.
Phase 4F (F.1 + F.2 + F.3, agent side). F.4 is the operator-run live smoke
(needs a NAS deployment); recorded in the PR, not code.
F.1 — on_jobs_changed wiring:
- cron/scheduler.py: _notify_provider_jobs_changed() — resolve the active
provider, call on_jobs_changed(), swallow errors. Lives in scheduler.py (not
jobs.py) so the store stays free of provider imports (no import cycle).
- Wired at the consumer surfaces AFTER a successful mutation: the cronjob model
tool (tools/cronjob_tools.py, create/update/remove/pause/resume) — which the
`hermes cron` CLI also routes through — and the REST handlers
(gateway/platforms/api_server.py, same five). Built-in's no-op default = zero
behavior change on the default path. Sleeping-agent direct jobs.json writes
(no tool/CLI/REST) are covered by reconcile-on-wake in start().
F.2 — config: cron.chronos.{portal_url,callback_url,expected_audience,
nas_jwks_url}. All non-secret; the agent holds no scheduler creds and the
outbound provision call reuses the existing Nous token (no token key). Additive
deep-merge key, no version literal.
F.3 — docs:
- docs/chronos-managed-cron-contract.md: authoritative agent↔NAS wire contract
(the three agent-cron endpoints + inbound /api/cron/fire + the 3-hop trust
model + at-most-once/re-arm semantics). This is what the NAS-side agent builds
against.
- cron-internals.md: "Managed cron (Chronos) for scale-to-zero" section.
- cli-commands.md: cron.provider accepts chronos + the cron.chronos.* keys.
- User docs name no scheduler vendor (QStash is a NAS-internal detail).
INVARIANT re-verified: zero qstash/upstash hits across plugins/cron, gateway,
hermes_cli, tools, website/docs (the one remaining repo hit is an unrelated
Context7 MCP comment in tools/mcp_tool.py).
Tests: test_jobs_changed_notify (5) — notify calls provider hook, swallows
errors, built-in harmless, tool create/remove notify. Full cron + chronos +
webhook + config + api_server_jobs suites green (504 in the cron+chronos+webhook
run).
Phase 4C. claim_job_for_fire(job_id, *, claim_ttl_seconds=300) in cron/jobs.py:
under the existing _jobs_lock() file lock, claim a job for a single external
fire so that across N gateway replicas exactly ONE wins. Single-machine
deployments always win (unaffected).
Semantics:
- missing / disabled / paused job → False.
- a fresh fire_claim (younger than claim_ttl_seconds) already present → False
(someone else holds it). Stale claim (crashed winner) → overwrite, so a job
is never wedged forever.
- on win: stamp fire_claim={at, by:_machine_id()}; for recurring (cron/interval)
advance next_run_at (mirrors advance_next_run's at-most-once bump so a stale
re-delivery can't re-fire); one-shots keep next_run_at but the fresh claim
blocks a duplicate retry for the same fire.
- mark_job_run now clears fire_claim on completion so a re-armed recurring job
is claimable again next fire.
_machine_id() (HERMES_MACHINE_ID env, else hostname:pid) is attribution-only;
correctness is the file lock + fresh-claim check, not the id.
This is consumed by CronScheduler.fire_due (Phase 4B). tick is untouched — it
still uses advance_next_run, so the built-in single-machine path is unaffected.
Tests (real store, temp HERMES_HOME): claim-once-then-block + next_run advance,
one-shot no-double-claim, unknown→False, paused→False, stale-claim reclaimable,
mark_job_run clears the claim (recurring re-claimable). tests/cron/ 470 passed.
Phase 4B. Three NON-abstract hooks on the CronScheduler ABC, all with
built-in-safe defaults so the built-in inherits them without overriding and
test_abc_growth_stays_additive stays green (required surface still {name,
start}):
- on_jobs_changed(): post-mutation reconcile hook. Built-in no-op.
- fire_due(job_id): claim the job via the store CAS (claim_job_for_fire,
Phase 4C) then run it through the shared run_one_job (Phase 4A). Returns
False if the claim is lost or the job vanished (repeat-N exhausted between
arm and fire). The inbound webhook (Phase 4E) routes here.
- reconcile(): converge the external registry toward jobs.json. Built-in no-op.
fire_due imports claim_job_for_fire/get_job/run_one_job INSIDE the method, so
this commits cleanly before Phase 4C lands claim_job_for_fire (import-time is
unaffected; tests monkeypatch it with raising=False).
Tests: required-surface-unchanged guard, built-in inherits no-op defaults, and
fire_due's three paths (claim+run, lost-claim→no-run, missing-job→no-run).
tests/cron/ green (20 in test_scheduler_provider.py).
Phase 4A. Factor tick's per-job closure (_process_job: execute → save →
deliver → mark) into a module-level run_one_job(job, *, adapters, loop,
verbose) so the external Chronos provider's fire_due (Phase 4D) reuses the
IDENTICAL body — no duplicated correctness. tick's _process_job is now a thin
wrapper calling run_one_job; the pool/in-flight-guard/contextvars dispatch
logic is unchanged.
run_one_job fires ONE given job; it does NOT decide due-ness, claim, or compute
next_run (tick advances next_run_at under the file lock; an external provider
claims via the store CAS in Phase 4C). Pure refactor, no behavior change.
TDD: test_run_one_job.py characterizes the sequence through tick() first
(test_tick_process_job_sequence, passed pre-extraction), then unit-tests the
helper directly: success sequence, [SILENT]→skip delivery, empty-response soft
failure (#8585), failed-job-still-delivers, exception→mark-failed.
Verified: tests/cron/ 459 passed (was 453 + 6 new); tick behavior unchanged.
Phase 2 of the pluggable cron-scheduler refactor. Still no call-site changes;
this wires up provider SELECTION with a hard safety net.
Task 2.1: cron.provider config key (hermes_cli/config.py), empty = built-in.
Additive key — deep-merge picks it up into existing configs with no version
bump (verified: load_config() yields the key on a pre-existing config.yaml).
Task 2.2: plugins/cron/__init__.py — discovery machinery cloned near-verbatim
from plugins/memory/__init__.py, retargeted at CronScheduler /
register_cron_scheduler. Bundled (plugins/cron/<name>/) + user
(/plugins/<name>/) dirs, bundled wins collisions. The built-in is
NOT discovered here — it's core, so the fallback can't be removed.
Task 2.3: resolve_cron_scheduler() in cron/scheduler_provider.py — reads
cron.provider and ALWAYS degrades to built-in (missing / unavailable / load
error / typo all fall back with a warning). cron can never be left without a
trigger.
Deviation from plan: the plan's resolver snippet used cfg_get("cron.provider")
(dotted-string form). The real cfg_get signature is cfg_get(cfg, *keys,
default=) — corrected to cfg_get(load_config(), "cron", "provider", default=""),
matching plugins/memory/__init__.py:349. Tests monkeypatch load_config (not
cfg_get) so the real traversal runs.
Tests: default key empty, discovery returns list, unknown load returns None,
and the four resolver paths (empty→builtin, no-section→builtin,
unknown→builtin, unavailable→builtin, available→used). Full tests/cron/: 453
passed; config suite green (additive key, no migration break).
Phase 1 of the pluggable cron-scheduler refactor (Axis B — the trigger).
No call-site changes; this phase only makes the abstraction exist + tested
in isolation.
Task 1.1: cron/scheduler_provider.py — the EXPERIMENTAL CronScheduler ABC.
Required surface is name + start; is_available()/stop() carry safe defaults.
is_available has a no-network invariant. Docstring marks it experimental
until the Chronos provider (Phase 4) validates the shape.
Task 1.2: InProcessCronScheduler wraps the historical 60s ticker loop, calling
cron.scheduler.tick(sync=False) exactly as the raw ticker does. Uses
stop_event.wait(interval) for responsive stop (both raw tickers already do).
Tests: ABC-is-abstract, default-is_available, the InProcess loop drives tick
and stops, stop() no-op, and test_abc_growth_stays_additive (the forward-compat
guard: required abstractmethods must stay exactly {name, start}, so the three
Phase-4 hooks land as NON-abstract additions).
tick() internals in cron/scheduler.py are byte-unchanged (only new file added).
Phase 0 characterization tests still green. Full tests/cron/: 445 passed.
Route curator rollback through the same cross-process cron job lock, make save_jobs lock for legacy direct callers without deadlocking nested mutation paths, and harden the regression test so a second _jobs_lock caller really blocks across processes.
`hermes cron pause`/`resume`/`remove` run in their own CLI process (CLI →
cronjob tool → pause_job → update_job → save_jobs), entirely separate from
the gateway process that also writes jobs.json (mark_job_run, advance_next_run,
due-fast-forward in get_due_jobs). The only synchronization was a module-level
`threading.Lock`, which serializes writers *within a single process* but does
nothing across processes — and update_job/pause_job/remove_job/create_job did
not even take it.
The result is a classic lost update: a `cron pause` issued while the gateway is
live loads jobs.json, sets enabled=False, and saves; concurrently the gateway
loads the same file and saves back its run-bookkeeping, clobbering the pause.
The CLI prints "Paused" (it succeeded against its own in-memory copy) but the
job stays enabled and keeps firing, with no error surfaced. The scheduler's
`.tick.lock` flock can't be reused for this — it is held for the entire tick,
including multi-minute agent runs, so a CLI mutation would block for minutes.
Add `_jobs_lock()`: a short-held cross-process advisory file lock (fcntl/msvcrt
flock on `<hermes_home>/cron/.jobs.lock`) layered over the existing in-process
lock, and wrap every load→modify→save critical section with it — create_job,
update_job, remove_job, mark_job_run, advance_next_run, get_due_jobs,
rewrite_skill_refs. The lock degrades to in-process-only if neither fcntl nor
msvcrt is available, preserving prior behaviour. All critical sections are short
(field edits, no agent execution), so contention resolves in milliseconds.
Adds a regression test that proves the lock excludes a second process (an
in-process threading.Lock cannot).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* docs: finish Automation Blueprints terminology rebrand
Replace leftover "Automation Templates" wording from the Cron Recipes
rebrand, rename the copy-paste cookbook guide to Automation Recipes, and
point the marketing gallery link at the blueprints catalog.
Co-authored-by: Cursor <cursoragent@cursor.com>
* docs: use Automation Blueprints instead of Recipes in guide
Rename the cookbook guide from automation-recipes to
automation-blueprints so sidebar and copy match the product term.
Co-authored-by: Cursor <cursoragent@cursor.com>
* docs: rename automation-blueprints-catalog to automation-blueprints
Drop the -catalog suffix from the reference page slug and title, and
move the copy-paste cookbook to automation-blueprint-examples so the
main Automation Blueprints doc is unambiguous.
Co-authored-by: Cursor <cursoragent@cursor.com>
* Revert "docs: rename automation-blueprints-catalog to automation-blueprints"
This reverts commit 605f1eeab5.
---------
Co-authored-by: Cursor <cursoragent@cursor.com>
Review fixes for the Cron Recipes stack before release:
- hydration-move: */90 in the cron minute field silently wraps to hourly
(croniter-verified) — 90/120-minute options never fired at their stated
cadence. Replaced with an hour-field step (0 9-17/2 * * 1-5) and an
interval_hours slot whose options (1/2/3h) all fire as labeled.
- fill_recipe: reject unknown slot names. A typo'd 'tiem=07:15' used to
silently create the job at the 08:00 default; now it 422s on the dashboard
form and errors on the slash/deep-link paths with the valid slot list.
- deliver slot: non-strict enum (options are suggestions, scheduler
validates downstream) so slack/whatsapp/etc. users aren't locked out;
GET /api/cron/recipes rewrites its options from cron_delivery_targets()
so the dashboard form only offers configured platforms; help text no
longer claims dashboard-created jobs deliver to 'the chat you set this
up from' (the endpoint strips origin — they go to the home channel).
- gateway: success/accept messages no longer point at /cron (cli_only);
surface-aware hint instead. Conversational fill now sends the
'Setting up X — I'll ask you a couple of things…' ack before the agent
turn, matching the CLI experience.
- important-mail catalog entry: reference the urgency classifier by module
path (python3 -m cron.scripts.classify_items) instead of baking an
absolute host path into the job prompt — stale after relocation and
nonexistent on remote terminal backends. cron/scripts is now a real
package and ships in the wheel (pyproject packages.find).
- export_recipe: interval schedules round-trip again — parse_schedule
stores 'minutes' but the renderer only read 'seconds', so every interval
job exported as the silent '0 9 * * *' fallback.
- skills_hub install: say so when a recipe suggestion is dropped
(latched dedup or pending cap) instead of printing nothing.
Targeted tests: 58 cron/recipe + 261 web_server pass; E2E-validated all
14 recipes fill+parse, hydration cadences via croniter, typo rejection on
slash + endpoint paths, surface-aware hints, and interval export round-trip.
Reworks the chat-line UX: pick a recipe by name and the agent asks you for
what it needs, one question at a time, instead of forcing you to hand-type a
slot=val command line.
- /cron-recipe -> lists the catalog
- /cron-recipe <name> -> forgiving name match (exact/prefix/substring/
fuzzy; ambiguous lists candidates), then seeds
the agent with a natural-language fill request
built from the recipe's typed slots + schedule
and prompt templates. The agent asks for each
value one at a time and calls the EXISTING
cronjob tool. No new tool.
- /cron-recipe <name> slot=val -> unchanged deterministic path (fill_recipe ->
create_job) for the dashboard/docs/power user.
Mechanism (no new plumbing, invariant-safe — the seed enters as a normal user
turn, never a synthetic injection):
- shared handler returns RecipeCommandResult{text, agent_seed}; match_recipe()
and build_recipe_seed() are the new shared pieces.
- gateway: dispatch rewrites event.text to the seed and falls through to the
agent (the same pattern /steer uses).
- CLI: handler sets a one-shot self._pending_agent_seed; the interactive loop
consumes it right after process_command() and runs it as the next turn.
The typed-slot schema stays the single source of truth (still validates the
form/inline path via fill_recipe); the agent path just renders those slots into
the questions to ask. Docs updated to lead with the name-then-ask flow.
A 'recipe' is a one-place definition of an automation that every surface
renders natively. The slot schema (cron/recipe_catalog.py) is the single
source of truth; four renderers consume it, and all paths end at the same
cron.jobs.create_job — no second job engine.
Form where there's a screen, conversation where there's a chat line:
- Dashboard / GUI app: a Recipes sub-tab on the Cron page renders each
recipe's typed slots as a form (time-picker, enum dropdown, free-text);
submit POSTs /api/cron/recipes/instantiate which fills + creates the job.
- CLI / TUI / messengers: /cron-recipe lists the catalog, shows a recipe's
fields, or fills + creates from a pasted 'key slot=val' command. The shared
handler (hermes_cli/cron_recipe_cmd.py) names any missing/invalid slot so
the agent can ask a targeted follow-up.
- Docs: a generated Cron Recipes catalog page (website, .mdx + React cards)
shows each recipe with a copy-paste command and a 'Send to App' button.
- Desktop: a hermes:// URL scheme (Electron single-instance lock +
setAsDefaultProtocolClient + open-url/second-instance) routes
hermes://cron-recipe/<key>?slot=val into the chat composer pre-filled.
Typed slots (time/enum/text/weekdays) with defaults: users never type raw
cron — recipes parameterize time-of-day and weekday sets and translate to
cron expressions; a free-text 'schedule' slot is the full-flexibility escape
hatch. Consent-first throughout: nothing schedules without an explicit submit
or send.
Core:
- cron/recipe_catalog.py — CronRecipe + RecipeSlot, 5 curated recipes,
recipe_form_schema / recipe_slash_command / recipe_deeplink /
recipe_catalog_entry renderers, fill_recipe (validate + translate to
create_job kwargs).
- hermes_cli/cron_recipe_cmd.py — shared /cron-recipe handler (CLI + TUI +
gateway never drift). CommandDef + dispatch in commands.py / cli.py /
gateway/run.py.
Dashboard: GET /api/cron/recipes + POST /api/cron/recipes/instantiate
(web_server.py), CronRecipes.tsx gallery+form, Segmented sub-tab on CronPage,
api.ts methods + types.
Desktop: hermes:// scheme end to end (main.cjs deep-link router + ready-queue,
preload onDeepLink/signalDeepLinkReady, global.d.ts types, desktop-controller
composer prefill, electron-builder protocols key).
Docs: extract-cron-recipes.py generator wired into prebuild.mjs,
cron-recipes-catalog.mdx + CronRecipesCatalog React component, sidebar entry.
Generated index json gitignored like skills.json.
Tests: 23 core (catalog/slots/schedule-resolution/validation/renderers/command
handler/generator) + 5 web_server endpoint tests. E2E verified end to end:
slot fill -> create_job -> persisted job with correct schedule/deliver/origin.
Hermes can propose automations and let the user accept them with one tap
via /suggestions, instead of making them assemble cron jobs by hand. Every
proposal — wherever it originates — flows through one surface.
Sources (the 'where suggestions come from'):
- catalog: curated starter automations (daily briefing, important-mail
monitor, weekly review, workday-start reminder) via /suggestions catalog
- recipe: installing a skill that carries a metadata.hermes.recipe block
registers a suggestion instead of auto-scheduling
- usage / integration: reserved for the background-review detector and
account-connect triggers (sources defined; emitters land next)
Pieces:
- cron/suggestions.py — the store. add/list/accept/dismiss, dedup+latch by
key (dismissed proposals never re-offered), pending cap so it can't become
a nag wall. Accepting calls the existing cron.jobs.create_job — there is
NO second job engine. Mirrors jobs.py storage (atomic writes, lock, 0600).
- cron/suggestion_catalog.py — the curated set. The important-mail monitor
entry is where the old proactive-monitor poll->classify->surface engine
lives now (cron/scripts/classify_items.py + the 'monitor' aux task), as ONE
catalog automation rather than a standalone feature.
- tools/recipes.py — recipe<->job bridge; register_recipe_suggestion() makes
a recipe source 'recipe' of this surface. recipe_to_job_spec() is the single
translation both the direct and suggestion paths share.
- hermes_cli/suggestions_cmd.py — shared /suggestions handler (CLI + gateway
never drift); /suggestions [accept N|dismiss N|catalog|clear].
- Wired: CommandDef + CLI dispatch (cli.py) + gateway dispatch (gateway/run.py)
+ aux 'monitor' task (config.py) + recipe-install hook (skills_hub.py).
Consent-first throughout: nothing auto-schedules; acceptance is always
explicit; dismissals latch.
Supersedes #41122 (proactive-monitor) and #41127 (recipes): both fold in here
as a catalog entry and a suggestion source respectively.
Tests: store (dedup/cap/accept/dismiss/latch), catalog seeding+idempotency,
recipe->suggestion bridge, command handler, aux config. E2E: recipe SKILL.md
-> parsed -> suggested -> accepted -> real cron job persisted to jobs.json.
Fully removes the cron per-job 'profile' arg added in #28124: the
cronjob tool schema field, CLI --profile flags on cron create/edit,
job-record storage/validation, the scheduler's _job_profile_context
wrapper, and the script-runner env override. Sequential-partition
logic reverts to workdir-only.
The context-local HERMES_HOME override in hermes_constants and the
subprocess bridging in tools/environments/local.py are kept — they
now have other consumers (dashboard multi-profile, TUI gateway).
The runtime assembled-prompt scan (#3968 lineage) selected its pattern
tier on has_skills alone. A script-driven, no-skills job injects its
script's stdout into the prompt, and that blob was scanned with the
STRICT user-prompt pattern set — so any command-shape string in the
data feed (e.g. a triage bot ingesting a bug report that quotes
`rm -rf /`) hard-blocked the job on every tick.
Script output and context_from output are runtime DATA produced by
operator-authored code — the same trust class as install-vetted skill
markdown, not a user-authored directive prompt. Select the scan tier by
what the assembled prompt CONTAINS: when it includes skill content OR
injected data, use the looser _scan_cron_skill_assembled set (keeps
unambiguous injection directives, drops command-shape patterns,
sanitizes invisible unicode instead of blocking).
Defense-in-depth is preserved:
- The raw user prompt is still strict-scanned at create/update
(api_server paths untouched) AND re-scanned strict at runtime even
when the looser tier was selected for the data blob.
- Plain no-script/no-skills jobs keep the strict scan on the whole
assembled prompt.
- Injection directives arriving via script stdout still block.
Rejected alternative: removing destructive_root_rm from the strict set
or a per-job skip_injection_scan flag — both weaken the guard globally.
A cron session's first message is the injected "[IMPORTANT: you are running as
a scheduled cron job …]" delivery hint, so with no explicit title the sidebar
and history rows fell back to that hint as their label.
Set the session title from the job (name → short prompt → id) with a run-time
suffix for uniqueness against the sessions.title index. Done after the run so
the agent's own INSERT keeps model/system_prompt — this only updates the title.
* fix: respect disabled auto-compaction on context overflow
Port from anomalyco/opencode#30749.
When compression.enabled is false, NO automatic compaction trigger may
fire. The proactive token-threshold paths (preflight + post-response
should_compress gate) already honoured the setting, but the three
provider-overflow recovery paths in the agent loop — long-context-tier
429, 413 payload-too-large, and context-overflow — called
_compress_context() unconditionally, silently compressing and rotating
the session against the user's explicit choice.
Add a single guard at the top of the overflow-recovery dispatch: when
compression is disabled and the error is one of those three overflow
classes, surface a terminal error (compaction_disabled: True) telling the
user to /compress manually, /new, switch to a larger-context model, or
reduce attachments. Manual /compress (force=True) is unaffected — it never
enters this loop.
Tests: new TestOverflowWithCompactionDisabled (413 + 400 overflow don't
compress when disabled; control case still compresses when enabled).
Existing overflow-recovery tests updated to enable compaction explicitly
(they verify the recovery fires); fixture defaults flipped to True to
match production (compression.enabled defaults to True).
* fix(dashboard): populate cron delivery dropdown from configured platforms
The dashboard cron-create/edit dropdown hardcoded five delivery options
(local, telegram, discord, slack, email), so users on Matrix — or any
other backend-supported platform — had no way to pick their channel even
though the cron scheduler delivers to all of them. It also offered
Telegram/Discord/etc. to users who never set those up.
- cron/scheduler.py: add cron_delivery_targets() — the single source of
truth. Intersects gateway-configured platforms with cron-deliverable
ones and reports whether each platform's home channel is set.
- web_server.py: GET /api/cron/delivery-targets exposes that list (+ the
implicit local option) to the dashboard.
- CronPage.tsx: both modals render options from the endpoint. Configured
platforms missing a home channel still appear, annotated "set a home
channel first" (option B), so the user knows what to fix. Edit modal
preserves a job's current target even if it's no longer configured.
Local-only state shows a "configure a platform under Channels" hint.
Validation: scheduler + endpoint E2E'd with a Matrix gateway (home set
and unset); 5 new tests; tests/cron + tests/hermes_cli/test_web_server
green (366 passed).
Follow-up on the parallel-dispatch decoupling: the sequential pass for
workdir/profile jobs still ran inline in the ticker thread, so a long
workdir/profile job reintroduced the exact starvation #37312 describes,
just for env-mutating jobs. And the MCP orphan sweep ran immediately
after dispatch in sync=False mode — before jobs finished — defeating its
own 'runs after every job' contract and racing jobs still spawning MCP
children.
- Sequential jobs now queue to a persistent single-thread cron-seq pool
(preserves one-at-a-time ordering across ticks, never blocks the tick).
- Same in-flight dedup guard now covers sequential jobs.
- MCP orphan sweep runs via a done-callback after the LAST dispatched job
completes in async mode; inline after as_completed in sync mode.
Verified E2E: tick(sync=False) returns in ~1ms with a 1.5s sequential job
in flight; sweep fires only after that job ends.
PR #13021 fixed serial starvation by adding ThreadPoolExecutor to tick(),
but kept as_completed(timeout=600) which still blocks the ticker thread
until the slowest job finishes. This causes the same starvation pattern:
when one job runs long (15+ min), other jobs' next_run_at expires past the
grace window and they get perpetually fast-forwarded instead of running.
This PR decouples dispatch from completion:
- Persistent ThreadPoolExecutor (reused across ticks, no auto-join)
- Fire-and-forget dispatch: tick submits and returns immediately
- Running-job guard: prevents re-dispatching active jobs
- sync parameter: defaults to True (backward compatible), callers opt
into sync=False for non-blocking behavior
- atexit shutdown handler for clean pool teardown
- gateway/run.py: production ticker opts into sync=False
Refs #33315 (complementary — that issue's PRs fix grace handling in
jobs.py; this PR prevents the grace from expiring in the first place)
A stray zero-width space (U+200B), BOM, or bidi control in loaded skill
markdown permanently killed any cron that loaded it. The skills-attached
assembled-prompt scan hard-blocked on any invisible-unicode char, even
though skill bodies are already install-time vetted by skills_guard.py and
the chars commonly appear in copy-pasted unicode docs / code examples.
The skills path now strips invisibles (logging the codepoints) and runs the
cleaned prompt. The raw user-prompt path (_scan_cron_prompt) keeps the hard
block — that is the actual #3968 injection surface, where a small directive
prompt with a ZWSP is a smoking gun, not prose. Stripping does not let a real
injection slip through: the directive still matches after sanitization.
_scan_cron_skill_assembled now returns (cleaned_prompt, error).
Salvages 8 distinct fixes from a batch of PRs by @kyssta-exe, reapplied
onto current main (original branches were stale) with a few refinements.
- cron(jobs.py): load_jobs() validates top-level JSON shape — a bare
list auto-repairs into the {"jobs": [...]} dict; scalars/null raise a
clear RuntimeError instead of an uncaught AttributeError that took
down the whole cron subsystem (#37065, closes#36867).
- web(web_server.py): close the per-action log file handle after Popen
so the parent stops leaking one fd per spawned action (#36843).
- web(web_server.py): DELETE /api/env returns 400 for invalid key names
instead of a misleading 500, mirroring PUT /api/env (#36840).
- gateway(gateway.py): read /proc/<pid>/cmdline inside a with-block so
the fd is released immediately instead of relying on GC (#36804).
- web-tools(web_tools.py): include "xai" in check_web_api_key() so a
configured X.AI web backend reports as available (#36802).
- compression(conversation_compression.py): mark the feasibility check
done only after it completes, and default the gate to "not checked"
if the attribute is missing (#36803).
- completion(completion.py): replace `ls` with directory globbing in the
generated bash/zsh/fish profile listers — handles names with spaces
and skips non-directory entries (#36806).
- terminal-tool(terminal_tool.py): drop a duplicate `import threading`
(#36808).
- claw(claw.py): the migrate recommendation now points at the real
`hermes gateway stop` command instead of the non-existent
`hermes stop` (#36795, #36796, closes#36771).
- tests: guard against a leaked HERMES_CRON_SESSION breaking gateway
approval tests — add it to the hermetic conftest unset list (root
cause, protects every test) and pop it in the affected test's
setup_method (#36796).
Co-authored-by: kyssta-exe <kyssta-exe@users.noreply.github.com>
The runtime cron prompt scanner (added in #3968 to plug the
"malicious skill carrying an injection payload" gap) reuses the same
critical-severity patterns as the create-time user-prompt scan against
the *assembled* prompt — which includes loaded skill markdown.
That works fine for narrow patterns like "ignore previous instructions"
which never legitimately appear in prose. It catastrophically false-
positives on command-shape patterns like `cat ~/.hermes/.env`,
`authorized_keys`, `/etc/sudoers`, and `rm -rf /`, which routinely
appear in security postmortems and runbooks as **descriptive prose**
about attacks, not as actual commands.
Concrete failure: the bundled `hermes-agent-dev` skill contains a
security postmortem section saying "the attacker could just
`cat ~/.hermes/.env`". Every PR-scout cron job that loaded this skill
was silently blocked with `Blocked: prompt matches threat pattern
'read_secrets'`. All 11 scout jobs failed for weeks.
Fix: split the scanner into two tiers and route by context:
- `_scan_cron_prompt` (strict, unchanged behavior) runs against
the small user-authored cron prompt at create/update and as a
runtime defense-in-depth when no skills are attached. A legit
user prompt has no business saying `cat .env`, so the strict
patterns still apply there.
- `_scan_cron_skill_assembled` (new, looser) runs against the
assembled prompt when skills are attached. It only catches
unambiguous prompt-injection directives ("ignore previous
instructions", "disregard your rules", "system prompt override",
"do not tell the user") plus invisible-unicode markers. Command-
shape patterns are dropped because they false-positive on prose.
This is defense-in-depth, not the only line of defense. Skill bodies
are already scanned at install time by `skills_guard.py`; the runtime
cron scan exists purely as a tripwire for an obvious injection
directive surviving a malicious install. Catching prose mentions of
commands was never the goal of #3968 — the test that planted a skill
containing `cat ~/.hermes/.env` was the wrong shape of test for the
threat model.
Tests:
- `_scan_cron_prompt` strict behavior preserved (56 existing tests
unchanged: bare `cat .env`, `rm -rf /`, etc. still block).
- New `TestScanCronSkillAssembled` class verifies the looser scanner:
injection / disregard / system-override / do-not-tell-the-user /
invisible-unicode still block; descriptive prose about attack
commands is allowed; GitHub auth-header allowlist still works.
- `test_skill_with_env_exfil_payload_raises` (planted `cat .env`
in skill body) replaced with `test_skill_with_env_exfil_command
_in_prose_is_allowed` documenting the new correct behavior with
the real-world postmortem-style example that triggered the bug.
- All 11 originally-failing PR-scout jobs validated end-to-end via
`_build_job_prompt` — assembled prompts now build successfully
with the `hermes-agent-dev` skill attached.
Total: 75/75 tests in cron + cronjob_tools + threat scanner pass;
544/544 across the wider cron / memory / threat-pattern surface.
Two defense-in-depth fixes on cron output path handling:
1. cron/jobs.py:update_job() rejects mutation of the immutable 'id' field
(raises ValueError). Dashboard PUT /api/cron/jobs/{id} converts this to
HTTP 400. Without this, an attacker who can reach the update endpoint
could rename a job's id to '../escape' and move its output directory
outside OUTPUT_DIR.
2. cron/jobs.py:_job_output_dir() validates job IDs before composing
paths: rejects '.', '..', '/', '\\', absolute paths, and Windows drive
prefixes. Used by save_job_output() and remove_job() so legacy unsafe
IDs (from before this guard) fail closed rather than half-applying a
shutil.rmtree or output write outside the sandbox.
Tests:
- update_job rejects {'id': '../escape'} without renaming
- remove_job(legacy '../escape' id) raises ValueError without deleting
files outside OUTPUT_DIR or removing the job from the store
- save_job_output rejects '..', './escape', 'nested/escape',
absolute paths
- dashboard PUT /api/cron/jobs/{id} with {'id': '../escape'} returns
400, job list unchanged
Salvaged from PR #29826 by @zapabob. Simplified implementation:
- Dropped a 23-line _validate_job_output_id() helper using Path.parts
semantics. The inline check (path separators + dot-components +
is_absolute) is shorter and behaviorally identical.
- Dropped the secondary OUTPUT_DIR.resolve()/relative_to() check —
redundant once we reject any path separator at the input boundary.
- Dropped the _docs/2026-05-21_cron-output-path-hardening_codex.md
planning artifact (we don't check planning docs into the repo).
Co-authored-by: teknium1 <127238744+teknium1@users.noreply.github.com>
The bug: cron/scheduler.py:_resolve_cron_enabled_toolsets returns an
LLM-supplied per-job enabled_toolsets verbatim. The disabled_toolsets
passed to AIAgent was a hardcoded [cronjob, messaging, clarify] that
ignored agent.disabled_toolsets from config.yaml. An LLM could call
cronjob(action='add', enabled_toolsets=['terminal','file'],
prompt='...') and the cron-spawned agent would receive terminal+file
even when the operator had globally disabled them.
Fix: new _resolve_cron_disabled_toolsets() helper that ALWAYS layers
agent.disabled_toolsets on top of the cron baseline. AIAgent's
disabled_toolsets takes precedence over enabled_toolsets, so this
stops the bypass regardless of what the per-job override contains.
This is the disabled-side fix. Three concurrent PRs (#25842, #25815,
#25780) proposed intersection-side variants on _resolve_cron_enabled_toolsets;
this fix is more robust because it stops the leak at the precedence
boundary AIAgent itself enforces, not at a layer above.
Regression test reproduces the issue's PoC exactly:
config.yaml has agent.disabled_toolsets=[terminal,file]; cron job has
enabled_toolsets=[web,terminal,file]; assertion: AIAgent receives
disabled_toolsets containing terminal AND file.
Salvaged from PR #25786 by @Schrotti77. Simplified the implementation:
dropped a 23-line _normalize_toolset_list() helper (handled str/tuple/
set/garbage input shapes) in favor of the existing convention
(agent_cfg.get('disabled_toolsets') or []) used elsewhere in the
codebase. YAML always parses these as lists; the elaborate normalizer
was theatre for shapes we never produce.
Closes#25752
Co-authored-by: teknium1 <127238744+teknium1@users.noreply.github.com>
ntfy now ships as a self-contained plugin under plugins/platforms/ntfy/
instead of editing 8 core files (gateway/config.py Platform enum,
gateway/run.py factory + auth maps, cron/scheduler.py, toolsets.py,
hermes_cli/status.py, agent/prompt_builder.py, gateway/channel_directory.py,
tools/send_message_tool.py).
All routing goes through gateway/platform_registry via register_platform():
- adapter_factory, check_fn, validate_config, is_connected
- env_enablement_fn seeds PlatformConfig.extra from NTFY_* env vars so
gateway status reflects env-only setups without instantiating httpx
- standalone_sender_fn handles deliver=ntfy cron jobs when cron runs
out-of-process from the gateway
- allowed_users_env / allow_all_env hook into _is_user_authorized
- cron_deliver_env_var=NTFY_HOME_CHANNEL for cron home routing
- platform_hint surfaces in the system prompt
- pii_safe=True (topic names are the only identifier; no PII to redact)
Tests moved to tests/gateway/test_ntfy_plugin.py using _plugin_adapter_loader
so the module lives under plugin_adapter_ntfy in sys.modules and cannot
collide with sibling plugin-adapter tests on the same xdist worker. The
core-file grep tests (Platform.NTFY in source, hermes-ntfy in toolsets,
etc.) are replaced with plugin-shape tests covering register() metadata,
env_enablement_fn output, and standalone_sender_fn behavior.
68 tests pass under scripts/run_tests.sh.
Add an official, production-grade WhatsApp integration via Meta's
Business Cloud API as a complement to the existing Baileys bridge.
No bridge subprocess, no QR codes, no account-ban risk — at the cost
of a Meta Business account and a public HTTPS webhook URL.
Setup is fully wizard-driven: 'hermes whatsapp-cloud' walks through
every credential with paste-time validation (catches the #1 trap of
pasting a phone number into the Phone Number ID field), generates a
verify token, and ends with copy-paste instructions for the
cloudflared / Meta-dashboard / Business Manager pieces that can't be
automated. The wizard also points users at Meta's Business Manager
for setting the bot's display name and profile picture.
Feature set:
- Inbound: text, images (with native-vision routing), voice notes
(STT), documents (small text inlined, larger cached), reply context.
- Outbound: text with WhatsApp-flavored markdown conversion, images,
videos, documents, opus voice notes via ffmpeg with MP3 fallback.
- Native interactive buttons for clarify, dangerous-command approval,
and slash-command confirmation flows — matches the Telegram /
Discord UX, graceful degrades to plain text.
- Read receipts (blue double-checkmarks) and typing indicator,
using Meta's combined endpoint so they fire in a single API call.
- Webhook security: X-Hub-Signature-256 HMAC verification (raw body,
constant-time), wamid deduplication, group-shaped-message refusal
(groups deferred to v2 — Baileys still covers them).
- Full integration with the gateway's session, cron, display-tier,
prompt-hint, and auth-allowlist systems. Cloud and Baileys can run
side-by-side against different phone numbers.
Also wires STT (speech-to-text) through Nous's managed audio gateway
for Nous subscribers — previously the default stt.provider=local
required a separate faster-whisper install. New subscribers now get
voice-note transcription out of the box.
Docs: 418-line user guide at website/docs/user-guide/messaging/
whatsapp-cloud.md, sidebar entry, environment-variables reference,
ADDING_A_PLATFORM.md updated with the optional interactive-UX
contract for future adapter authors.
Tests: 100 dedicated tests for the adapter, 32 for the setup wizard,
20 for the Nous subscription STT wiring, plus regression coverage
across display_config, prompt_builder, and the cron scheduler.
Known limitations (deferred until clear demand signal):
- Group chats — use the Baileys bridge if you need them.
- Message templates for 24-hour-window outside-conversation sends —
reactive chat is unaffected; cron / delegate_task with gaps > 24h
will fail with a clear error. The agent's system prompt warns the
model about this so it knows to mention it when scheduling delayed
messages.
Apply CREATE_NO_WINDOW flags when the cron scheduler launches job scripts on Windows so gateway-managed no-agent cron jobs do not flash cmd or python console windows every tick.
1. trajectory_compressor.py: yaml.safe_load() returns None on empty
files, crashing with TypeError on `if 'tokenizer' in data`. Fix by
adding `or {}` fallback. (HIGH — blocks startup with empty config)
2. 6 files with fcntl.flock(LOCK_UN) in finally blocks without
try/except: cron/scheduler.py, hermes_cli/auth.py,
agent/shell_hooks.py, tools/skill_usage.py,
tools/environments/file_sync.py, tools/memory_tool.py. If unlock
raises OSError, fd.close() is skipped and the lock is held forever.
The msvcrt branches already had try/except; the fcntl branches did
not. Fix by wrapping in try/except (OSError, IOError): pass.
3. agent/copilot_acp_client.py line 639: TOCTOU race — path.exists()
followed by path.read_text() with no try/except. If file is deleted
between the check and the read, FileNotFoundError propagates. Fix
by using try/except FileNotFoundError.
4. gateway/sticker_cache.py: non-atomic write via Path.write_text()
can leave truncated JSON on crash, causing JSONDecodeError on next
load. Fix by writing to tempfile + fsync + os.replace (atomic).
When Telegram topic mode is enabled, cron messages delivered to the bot's
root DM (TELEGRAM_HOME_CHANNEL without a thread id) land in the system
lobby — replies there are rebuffed with the lobby reminder and
reply_to_message_id is dropped, so users cannot interact with the cron
output (#24409).
Add an optional TELEGRAM_CRON_THREAD_ID env var that overrides
TELEGRAM_HOME_CHANNEL_THREAD_ID for cron deliveries only. Operators can
create a "Cron" forum topic in the DM, point this var at its thread id,
and replies to cron messages will land in that topic's existing session
instead of the lobby. The home-channel thread id (used elsewhere, e.g.
restart notifications) is unchanged, and explicit
deliver="telegram:chat:thread" targets continue to win over the env var.
Per the reporter's clarification on 2026-05-13, option (a) (cron-side
route to a dedicated topic + config knob) was chosen.
Fixes#24409
Two code paths call json.loads() on output from external tools without
catching JSONDecodeError. If the tool returns a non-JSON string (error
message, empty string, or None), the entire call path crashes.
1. gateway/run.py — text_to_speech_tool() result in voice reply path.
A TTS failure that returns an error string instead of JSON crashes
the voice reply handler, killing the message response entirely.
2. cron/scheduler.py — skill_view() result when loading skills for
cron jobs. A corrupted or missing skill file that returns an error
string instead of JSON crashes the cron tick, preventing all jobs
from executing that cycle.
Both fixes catch (json.JSONDecodeError, TypeError), log a warning,
and gracefully skip the failed operation instead of crashing.
Instead of raising FileNotFoundError (which silently bricks the job),
log a warning and fall back to the scheduler default home. Validates
at create/update time still catches typos. Idea from PR #19958.