* Revert "fix(cron): scope job execution to its owning profile (#32091 follow-up) (#50993)"
This reverts commit 660e36f097.
* Revert "fix(cron): anchor cron storage at the default root home (not the active profile)"
This reverts commit a5c09fd176.
The #32091 fix moved every profile's cron jobs into one shared root store,
but never wired the execution-scoping half it recommended: a job still ran
under whichever profile's ticker picked it up, not its owning profile. So a
job created under `hermes -p donna` could execute with the root profile's
.env / config.yaml / credentials.
- jobs.py: create_job auto-captures the active profile (explicit profile=
override available) and stores it on the job; resolve_profile_home() maps a
profile name to its HERMES_HOME; legacy jobs backfill to 'default'.
- scheduler.py: run_job applies the job's profile via a scoped HERMES_HOME
override (env var + in-process ContextVar) before any .env/config/script
load, restored in finally. tick() routes profile-mismatched jobs to the
single-worker sequential pool so the env mutation can't race.
- cronjob tool threads profile through (NOT exposed in the model schema, to
avoid cross-profile privilege escalation); hermes cron add gains --profile.
E2E verified against a temp HERMES_HOME with a real profile dir: a root-profile
ticker runs a profile='donna' job with HERMES_HOME=donna during execution and
restores the ticker env afterward.
A cron job that sets `enabled_toolsets` to a list of *native* toolsets (e.g.
`["web", "terminal"]`) silently got ZERO MCP tools, while a job with no
per-job list got every globally-enabled MCP server. `_resolve_cron_enabled_
toolsets` returned the per-job list verbatim, bypassing the MCP-merge that the
platform-fallback branch performs via `_get_platform_tools`. So
`discover_mcp_tools()` registered the MCP tools into the registry, but
`get_tool_definitions(enabled_toolsets=...)` kept only the named native
toolsets — the agent then rejected every `mcp_*` call as "Unknown tool". (R2
of #23997.)
Fix: `_merge_mcp_into_per_job_toolsets` layers MCP membership onto a per-job
allowlist with the SAME semantics as `_get_platform_tools`:
* `no_mcp` sentinel present -> no MCP servers (sentinel stripped)
* one or more MCP server names already listed -> treat as an allowlist
* otherwise -> union in every globally-enabled MCP server
To avoid duplicating the "which MCP servers are enabled" computation (it
already existed inline in `_get_platform_tools`), this extracts a shared
`enabled_mcp_server_names(config)` helper in `hermes_cli.tools_config` and has
BOTH the gateway/CLI platform resolver and the cron per-job resolver call it —
so every path agrees on MCP membership (extend, don't duplicate).
Note: the issue's *headline* — bare MCP server names rejected, registry never
includes them — was already fixed on main (commits c10fea8d2 + 04918345e,
both before the issue was filed). This PR closes the remaining cron-specific
gap (R2). The `server:*` / `mcp:server` alias-notation rejection (R1) and the
quiet-mode silent-drop (R3) are tracked separately.
Salvaged from #32788 by sherman-yang (credited below). Reworked to reuse the
shared `enabled_mcp_server_names` helper instead of re-implementing the MCP
membership set in cron/scheduler.py.
Fixes#23997
Co-authored-by: sherman-yang <58446328+sherman-yang@users.noreply.github.com>
`cron/jobs.py` resolved `HERMES_DIR`/`JOBS_FILE` from `get_hermes_home()`,
which follows the active profile override. So a job created from a
profile-scoped agent session (`hermes -p myprofile chat`, where the in-process
`cronjob` tool calls `create_job`) was written to
`~/.hermes/profiles/myprofile/cron/jobs.json`, while the profile-less gateway
(`hermes gateway run`) reads only `~/.hermes/cron/jobs.json`. The job was
silently orphaned: `cronjob action=list` from the same profile reported it
healthy (same file), but the gateway ticker never saw it and it never fired.
`last_run_at` stayed null forever. (#32091)
Fix: resolve the cron store from `get_default_hermes_root()` — the
purpose-built "profile-level operations" root that returns `<root>` even when
`HERMES_HOME` is `<root>/profiles/<name>` (and handles Docker/custom layouts).
Now the creator, the gateway scheduler, and the dashboard all agree on a
single jobs.json at the root, so a job created under any profile is visible to
the gateway.
Scope: this is the storage-location half of the fix. Making a job *execute*
under its originating profile's config/skills (a per-job `profile` field +
runtime context scoping, the #48649 sibling) is a separate, riskier change and
will follow as its own PR — keeping this layer minimal and safe.
Salvaged from #32117 by @mohamedorigami-jpg (authorship preserved). The
comprehensive #33839 (@sweetcornna) takes the same Option-A storage approach
and additionally adds the per-job profile execution scoping; this PR lands the
safe storage layer first.
Tests: `tests/cron/test_cron_profile_storage.py` — asserts the store anchors
at `<root>/cron` under a profile HERMES_HOME (not `<profile>/cron`), and is
unchanged when no profile is active. Full `tests/cron/` suite: 511 passed.
Fixes#32091
Co-authored-by: mohamedorigami-jpg <mohamed.origami@gmail.com>
PR #22410 added three-mode Telegram topic routing to the live message path
(TelegramAdapter.send via the gateway DeliveryRouter), but the cron delivery
path never got it. cron/scheduler.py::_deliver_result sent through the live
adapter with a bare ``{"thread_id": ...}`` and fell back to the standalone
_send_telegram, neither of which addresses Bot API Direct Messages topics
correctly. After Bot API 10.0 (2026-05-08), sending to a private chat with a
bare ``message_thread_id`` is rejected/mis-routed, so cron deliveries to a
private DM topic landed in the General topic instead of the requested lane.
Fix: the cron live-adapter branch now routes the text send through the
gateway's ``DeliveryRouter._deliver_to_platform`` — the same canonical path
live messages use — so it inherits all three Telegram routing modes:
1. Forum/supergroup (negative chat_id) -> message_thread_id
2. Bot API DM topics (private chat_id + numeric topic id) ->
direct_messages_topic_id (the case #22773 reported)
3. Hermes-created named private DM-topic lanes -> ensure_dm_topic +
reply anchor
For mode 2, a private-chat target with a numeric topic id is passed as
``direct_messages_topic_id`` metadata (verified end-to-end:
TelegramAdapter._thread_kwargs_for_send turns it into
``{message_thread_id: None, direct_messages_topic_id: <int>}``), instead of a
bare message_thread_id. Forum/supergroup and home-channel deliveries are
unchanged. The standalone fallback (gateway down) is preserved.
No new config knob and no duplicated routing logic — this reuses the existing
DeliveryRouter rather than reimplementing topic routing in the cron path.
Salvaged from #42051 (stepanov1975) and #23249 (devsart95), which both
diagnosed the missing three-mode routing in the cron/standalone path;
reimplemented onto the canonical DeliveryRouter that landed since those PRs
were opened.
Co-authored-by: Alex <9785479+stepanov1975@users.noreply.github.com>
Co-authored-by: devsart95 <devsart95@gmail.com>
The in-process cron ticker (cron/scheduler_provider.py) caught only
`Exception` and logged at DEBUG, so a `SystemExit`/`KeyboardInterrupt`
raised from a misbehaving provider SDK or agent retry path killed the
ticker thread silently. The gateway PROCESS stayed up, so `hermes cron
status` — which only checks `find_gateway_pids()` — kept reporting
"✓ jobs will fire automatically" while no jobs ever fired (#32612,
#32895).
This makes ticker death survivable and detectable:
- The ticker loop now catches `BaseException` and logs at ERROR with a
traceback, so a single bad tick no longer tears the thread down and
the failure is visible in the gateway log.
- The loop records a heartbeat (`cron/ticker_heartbeat`, epoch seconds)
on startup and after every tick — best-effort, never raised into the
loop. Both ticker entry points (the gateway and the desktop fallback
in web_server.py) funnel through `InProcessCronScheduler.start`, so one
heartbeat site covers both.
- `hermes cron status` now reads the heartbeat age: if the gateway is
running but the heartbeat is stale (> 200s, i.e. several missed ~60s
ticks), it reports the ticker as STALLED and suggests a restart instead
of falsely claiming jobs will fire. A missing heartbeat (older build /
never ran) is treated as "unknown", not "dead".
Adds tests for BaseException survival, per-iteration heartbeat recording,
heartbeat round-trip/age, staleness detection, and silent-write-failure.
Salvaged from #49660 (BaseException survival on current structure),
extended with the heartbeat + honest-status reporting that the earlier
(pre-refactor) watchdog PRs #35616 and #33849 proposed.
Fixes#32612Fixes#32895
Co-authored-by: banditburai <promptsiren@gmail.com>
Co-authored-by: sweetcornna <96944678+sweetcornna@users.noreply.github.com>
Consolidates three cron-delivery defects in cron/scheduler.py::_deliver_result
that all stem from how the live-adapter send result is interpreted.
#38922 — duplicate message on confirmation timeout.
future.result(timeout=60) raising TimeoutError bubbled to the outer
except handler, which left delivered=False, so `if not delivered:` re-sent
the identical message via the standalone path. future.cancel() cannot
un-send a request already in flight on the wire, so a slow confirmation
deterministically produced a duplicate. The send was already dispatched onto
the gateway loop, so a bare timeout is now treated as delivered
(assume-delivered is safer than guaranteed-duplicate) and the standalone
fallback is skipped. The live-adapter media attempt is also skipped on
timeout since the contended loop would re-block each 30s media budget.
#47056 — silent drop when the gateway has an active session.
The old check `if send_result is None or not getattr(send_result,
"success", True)` let a result object missing a `success` attribute default
to True = counted as a successful delivery, so the scheduler logged
"delivered via live adapter" while the gateway never processed the message.
Delivery is now confirmed via _confirm_adapter_delivery(): only an explicit,
truthy `success` attribute counts; None or a `success`-less object falls
through to the standalone path so the message actually arrives.
A genuine send Exception (not a slow confirmation) still falls through to
the standalone path, and is caught by run_job's outer handler — it is
recorded as the job's last_error and never crashes the cron ticker.
#43014 — deliver=origin fails to resolve in CLI sessions.
A CLI-created job has no {platform, chat_id} origin, so deliver=origin (and
auto-detect / deliver=None) was unresolvable and emitted "no delivery target
resolved" on every run. An unresolvable origin with no configured home
channel is now treated as local (output stays in last_output), matching the
documented auto-deliver contract; a concrete unresolvable platform target
still reports a real error.
Salvaged from #41007 (timeout discriminator), folding in #47127's
_confirm_adapter_delivery hardening and #38937 / #43063's origin→local
fallback. Tests rewritten as behavior contracts (timeout => no duplicate;
None / success-less result => standalone fallback; confirmed success => no
fallback; CLI origin => local, explicit platform => still errors).
Co-authored-by: Evi Nova <66773372+Tranquil-Flow@users.noreply.github.com>
Co-authored-by: kyssta-exe <kyssta-exe@users.noreply.github.com>
Cron jobs created without an explicit `model` are stored as `model: null`.
At fire time `run_job` resolved `model = job.get("model") or os.getenv(
"HERMES_MODEL") or ""` and then `_model_cfg.get("default", model)`, so when
config.yaml had no `model.default` (or `model: {default: null}`) an empty
string flowed straight to the provider and surfaced as an opaque HTTP 400
("Model parameter is required" / "model: String should have at least 1
character"). The operator had to inspect jobs.json to discover the job was
stored with a null model.
This change makes cron model resolution robust and symmetric with the CLI:
- Coerce `model: null`/missing config to `{}` so a falsy default never
overwrites an already-resolved env value with `None`.
- Only overwrite `model` from `model.default` when the resolved value is
truthy; accept a `model.model` alias key, mirroring the sibling resolvers
in hermes_cli/oneshot.py, fallback_cmd.py and prompt_size.py.
- Resolve AFTER the managed-scope overlay so an administrator-pinned model
still wins.
- Fail fast with an actionable error (caught by run_job's outer handler and
recorded as the job's last_error — the cron ticker is unaffected) instead
of letting an empty model reach the API.
- The per-job model is re-read every tick, so a `cronjob action=update
model=...` after a failed run takes effect on the next tick (no cache).
Adds tests/cron/conftest.py pinning a default HERMES_MODEL so existing
run_job tests don't trip the new guard, plus regression tests covering env
fallback, config.default fallback, string-form config, the model alias key,
null-default-no-clobber, corrupt-config graceful degradation, fail-fast,
and the no-cache re-read property.
Salvaged from #24005, rebased onto current main, with additional test
coverage folded in from #45550 and the alias-key behavior from #43952.
Fixes#43899Fixes#23979Fixes#22761
Co-authored-by: szzhoujiarui-sketch <szzhoujiarui@gmail.com>
Co-authored-by: rayjun <rayjun0412@gmail.com>
Matches the env= callsite convention at the other sanitized
subprocess spawns (cua_backend dict(os.environ), gateway
os.environ.copy()). Functionally equivalent — _sanitize_subprocess_env
never mutates its input — but avoids handing the live mapping to the
helper.
Follow-up to salvaged PR #49207.
Cron no_agent and pre-check scripts ran with the full gateway/agent
environment, allowing scripts under HERMES_HOME/scripts/ to read provider
credentials. Apply _sanitize_subprocess_env like terminal and MCP paths
(SECURITY.md section 2.3).
Add regression test asserting blocklisted provider vars are absent in the
child process.
The skin bug was one instance of a class: several subsystems build their
config dict directly from config.yaml instead of routing through
hermes_cli.config.load_config (which carries the managed merge), so they
silently ignored administrator-pinned values. Audited every config.yaml
reader and fixed the behavioral-read bypasses:
- gateway/config.py load_gateway_config (messaging gateway: session_reset,
quick_commands, stt, model, ...)
- gateway/run.py _load_gateway_config (its read_raw_config fast path also
skipped the merge — read_raw_config returns raw user YAML)
- tui_gateway/server.py _load_cfg (new TUI + desktop backend: skin,
reasoning_effort, service_tier, provider_routing)
- cron/scheduler.py (scheduled-job model/reasoning/toolsets/provider_routing)
- hermes_logging.py (logging.level/max_size_mb/backup_count)
- hermes_time.py (timezone)
- hermes_cli/doctor.py (memory-provider diagnostic reads effective config)
All route through a new shared managed_scope.apply_managed_overlay() helper
that mirrors _load_config_impl (env-only expansion so a user ${VAR} can't
shadow a managed literal, root-model-string normalization, leaf-merge) and is
fail-open. cli.py's earlier inline fix is refactored onto the same helper.
Write-back paths (slash_commands, telegram/yuanbao dm_topics, profile
distribution) are deliberately left reading raw user YAML — overlaying managed
values there would persist them into the user file. The dashboard
(web_server.py) already routes through load_config and needed no change.
TUI loader caches the RAW config so _save_cfg never writes managed values to
disk. Adds test_managed_scope_overlay.py (helper) and
test_managed_scope_loaders.py (per-surface integration); mutation-checked.
Two small, focused fixes for the cron scheduler and checkpoint manager.
1. _summarize_cron_failure_for_delivery (cron/scheduler.py):
Replaces the raw error dump in _process_job with a compact
pattern-matched summary. Provider rate limits, timeouts, and
authentication errors now produce a short human-readable message
instead of dumping multi-KB provider JSON into the delivery channel.
2. _repair_bare_repo_dirs (tools/checkpoint_manager.py):
Recreates refs/heads/ and branches/ directories after git gc
--prune=now, which can remove empty dirs from bare repos and cause
subsequent git add -A to fail with 'fatal: not a git repository'.
Called after all four git gc call sites.
Both fixes use only standard library imports and plug into existing
call sites with no architectural changes.
Phase 4F (F.1 + F.2 + F.3, agent side). F.4 is the operator-run live smoke
(needs a NAS deployment); recorded in the PR, not code.
F.1 — on_jobs_changed wiring:
- cron/scheduler.py: _notify_provider_jobs_changed() — resolve the active
provider, call on_jobs_changed(), swallow errors. Lives in scheduler.py (not
jobs.py) so the store stays free of provider imports (no import cycle).
- Wired at the consumer surfaces AFTER a successful mutation: the cronjob model
tool (tools/cronjob_tools.py, create/update/remove/pause/resume) — which the
`hermes cron` CLI also routes through — and the REST handlers
(gateway/platforms/api_server.py, same five). Built-in's no-op default = zero
behavior change on the default path. Sleeping-agent direct jobs.json writes
(no tool/CLI/REST) are covered by reconcile-on-wake in start().
F.2 — config: cron.chronos.{portal_url,callback_url,expected_audience,
nas_jwks_url}. All non-secret; the agent holds no scheduler creds and the
outbound provision call reuses the existing Nous token (no token key). Additive
deep-merge key, no version literal.
F.3 — docs:
- docs/chronos-managed-cron-contract.md: authoritative agent↔NAS wire contract
(the three agent-cron endpoints + inbound /api/cron/fire + the 3-hop trust
model + at-most-once/re-arm semantics). This is what the NAS-side agent builds
against.
- cron-internals.md: "Managed cron (Chronos) for scale-to-zero" section.
- cli-commands.md: cron.provider accepts chronos + the cron.chronos.* keys.
- User docs name no scheduler vendor (QStash is a NAS-internal detail).
INVARIANT re-verified: zero qstash/upstash hits across plugins/cron, gateway,
hermes_cli, tools, website/docs (the one remaining repo hit is an unrelated
Context7 MCP comment in tools/mcp_tool.py).
Tests: test_jobs_changed_notify (5) — notify calls provider hook, swallows
errors, built-in harmless, tool create/remove notify. Full cron + chronos +
webhook + config + api_server_jobs suites green (504 in the cron+chronos+webhook
run).
Phase 4A. Factor tick's per-job closure (_process_job: execute → save →
deliver → mark) into a module-level run_one_job(job, *, adapters, loop,
verbose) so the external Chronos provider's fire_due (Phase 4D) reuses the
IDENTICAL body — no duplicated correctness. tick's _process_job is now a thin
wrapper calling run_one_job; the pool/in-flight-guard/contextvars dispatch
logic is unchanged.
run_one_job fires ONE given job; it does NOT decide due-ness, claim, or compute
next_run (tick advances next_run_at under the file lock; an external provider
claims via the store CAS in Phase 4C). Pure refactor, no behavior change.
TDD: test_run_one_job.py characterizes the sequence through tick() first
(test_tick_process_job_sequence, passed pre-extraction), then unit-tests the
helper directly: success sequence, [SILENT]→skip delivery, empty-response soft
failure (#8585), failed-job-still-delivers, exception→mark-failed.
Verified: tests/cron/ 459 passed (was 453 + 6 new); tick behavior unchanged.
Fully removes the cron per-job 'profile' arg added in #28124: the
cronjob tool schema field, CLI --profile flags on cron create/edit,
job-record storage/validation, the scheduler's _job_profile_context
wrapper, and the script-runner env override. Sequential-partition
logic reverts to workdir-only.
The context-local HERMES_HOME override in hermes_constants and the
subprocess bridging in tools/environments/local.py are kept — they
now have other consumers (dashboard multi-profile, TUI gateway).
The runtime assembled-prompt scan (#3968 lineage) selected its pattern
tier on has_skills alone. A script-driven, no-skills job injects its
script's stdout into the prompt, and that blob was scanned with the
STRICT user-prompt pattern set — so any command-shape string in the
data feed (e.g. a triage bot ingesting a bug report that quotes
`rm -rf /`) hard-blocked the job on every tick.
Script output and context_from output are runtime DATA produced by
operator-authored code — the same trust class as install-vetted skill
markdown, not a user-authored directive prompt. Select the scan tier by
what the assembled prompt CONTAINS: when it includes skill content OR
injected data, use the looser _scan_cron_skill_assembled set (keeps
unambiguous injection directives, drops command-shape patterns,
sanitizes invisible unicode instead of blocking).
Defense-in-depth is preserved:
- The raw user prompt is still strict-scanned at create/update
(api_server paths untouched) AND re-scanned strict at runtime even
when the looser tier was selected for the data blob.
- Plain no-script/no-skills jobs keep the strict scan on the whole
assembled prompt.
- Injection directives arriving via script stdout still block.
Rejected alternative: removing destructive_root_rm from the strict set
or a per-job skip_injection_scan flag — both weaken the guard globally.
A cron session's first message is the injected "[IMPORTANT: you are running as
a scheduled cron job …]" delivery hint, so with no explicit title the sidebar
and history rows fell back to that hint as their label.
Set the session title from the job (name → short prompt → id) with a run-time
suffix for uniqueness against the sessions.title index. Done after the run so
the agent's own INSERT keeps model/system_prompt — this only updates the title.
* fix: respect disabled auto-compaction on context overflow
Port from anomalyco/opencode#30749.
When compression.enabled is false, NO automatic compaction trigger may
fire. The proactive token-threshold paths (preflight + post-response
should_compress gate) already honoured the setting, but the three
provider-overflow recovery paths in the agent loop — long-context-tier
429, 413 payload-too-large, and context-overflow — called
_compress_context() unconditionally, silently compressing and rotating
the session against the user's explicit choice.
Add a single guard at the top of the overflow-recovery dispatch: when
compression is disabled and the error is one of those three overflow
classes, surface a terminal error (compaction_disabled: True) telling the
user to /compress manually, /new, switch to a larger-context model, or
reduce attachments. Manual /compress (force=True) is unaffected — it never
enters this loop.
Tests: new TestOverflowWithCompactionDisabled (413 + 400 overflow don't
compress when disabled; control case still compresses when enabled).
Existing overflow-recovery tests updated to enable compaction explicitly
(they verify the recovery fires); fixture defaults flipped to True to
match production (compression.enabled defaults to True).
* fix(dashboard): populate cron delivery dropdown from configured platforms
The dashboard cron-create/edit dropdown hardcoded five delivery options
(local, telegram, discord, slack, email), so users on Matrix — or any
other backend-supported platform — had no way to pick their channel even
though the cron scheduler delivers to all of them. It also offered
Telegram/Discord/etc. to users who never set those up.
- cron/scheduler.py: add cron_delivery_targets() — the single source of
truth. Intersects gateway-configured platforms with cron-deliverable
ones and reports whether each platform's home channel is set.
- web_server.py: GET /api/cron/delivery-targets exposes that list (+ the
implicit local option) to the dashboard.
- CronPage.tsx: both modals render options from the endpoint. Configured
platforms missing a home channel still appear, annotated "set a home
channel first" (option B), so the user knows what to fix. Edit modal
preserves a job's current target even if it's no longer configured.
Local-only state shows a "configure a platform under Channels" hint.
Validation: scheduler + endpoint E2E'd with a Matrix gateway (home set
and unset); 5 new tests; tests/cron + tests/hermes_cli/test_web_server
green (366 passed).
Follow-up on the parallel-dispatch decoupling: the sequential pass for
workdir/profile jobs still ran inline in the ticker thread, so a long
workdir/profile job reintroduced the exact starvation #37312 describes,
just for env-mutating jobs. And the MCP orphan sweep ran immediately
after dispatch in sync=False mode — before jobs finished — defeating its
own 'runs after every job' contract and racing jobs still spawning MCP
children.
- Sequential jobs now queue to a persistent single-thread cron-seq pool
(preserves one-at-a-time ordering across ticks, never blocks the tick).
- Same in-flight dedup guard now covers sequential jobs.
- MCP orphan sweep runs via a done-callback after the LAST dispatched job
completes in async mode; inline after as_completed in sync mode.
Verified E2E: tick(sync=False) returns in ~1ms with a 1.5s sequential job
in flight; sweep fires only after that job ends.
PR #13021 fixed serial starvation by adding ThreadPoolExecutor to tick(),
but kept as_completed(timeout=600) which still blocks the ticker thread
until the slowest job finishes. This causes the same starvation pattern:
when one job runs long (15+ min), other jobs' next_run_at expires past the
grace window and they get perpetually fast-forwarded instead of running.
This PR decouples dispatch from completion:
- Persistent ThreadPoolExecutor (reused across ticks, no auto-join)
- Fire-and-forget dispatch: tick submits and returns immediately
- Running-job guard: prevents re-dispatching active jobs
- sync parameter: defaults to True (backward compatible), callers opt
into sync=False for non-blocking behavior
- atexit shutdown handler for clean pool teardown
- gateway/run.py: production ticker opts into sync=False
Refs #33315 (complementary — that issue's PRs fix grace handling in
jobs.py; this PR prevents the grace from expiring in the first place)
A stray zero-width space (U+200B), BOM, or bidi control in loaded skill
markdown permanently killed any cron that loaded it. The skills-attached
assembled-prompt scan hard-blocked on any invisible-unicode char, even
though skill bodies are already install-time vetted by skills_guard.py and
the chars commonly appear in copy-pasted unicode docs / code examples.
The skills path now strips invisibles (logging the codepoints) and runs the
cleaned prompt. The raw user-prompt path (_scan_cron_prompt) keeps the hard
block — that is the actual #3968 injection surface, where a small directive
prompt with a ZWSP is a smoking gun, not prose. Stripping does not let a real
injection slip through: the directive still matches after sanitization.
_scan_cron_skill_assembled now returns (cleaned_prompt, error).
The runtime cron prompt scanner (added in #3968 to plug the
"malicious skill carrying an injection payload" gap) reuses the same
critical-severity patterns as the create-time user-prompt scan against
the *assembled* prompt — which includes loaded skill markdown.
That works fine for narrow patterns like "ignore previous instructions"
which never legitimately appear in prose. It catastrophically false-
positives on command-shape patterns like `cat ~/.hermes/.env`,
`authorized_keys`, `/etc/sudoers`, and `rm -rf /`, which routinely
appear in security postmortems and runbooks as **descriptive prose**
about attacks, not as actual commands.
Concrete failure: the bundled `hermes-agent-dev` skill contains a
security postmortem section saying "the attacker could just
`cat ~/.hermes/.env`". Every PR-scout cron job that loaded this skill
was silently blocked with `Blocked: prompt matches threat pattern
'read_secrets'`. All 11 scout jobs failed for weeks.
Fix: split the scanner into two tiers and route by context:
- `_scan_cron_prompt` (strict, unchanged behavior) runs against
the small user-authored cron prompt at create/update and as a
runtime defense-in-depth when no skills are attached. A legit
user prompt has no business saying `cat .env`, so the strict
patterns still apply there.
- `_scan_cron_skill_assembled` (new, looser) runs against the
assembled prompt when skills are attached. It only catches
unambiguous prompt-injection directives ("ignore previous
instructions", "disregard your rules", "system prompt override",
"do not tell the user") plus invisible-unicode markers. Command-
shape patterns are dropped because they false-positive on prose.
This is defense-in-depth, not the only line of defense. Skill bodies
are already scanned at install time by `skills_guard.py`; the runtime
cron scan exists purely as a tripwire for an obvious injection
directive surviving a malicious install. Catching prose mentions of
commands was never the goal of #3968 — the test that planted a skill
containing `cat ~/.hermes/.env` was the wrong shape of test for the
threat model.
Tests:
- `_scan_cron_prompt` strict behavior preserved (56 existing tests
unchanged: bare `cat .env`, `rm -rf /`, etc. still block).
- New `TestScanCronSkillAssembled` class verifies the looser scanner:
injection / disregard / system-override / do-not-tell-the-user /
invisible-unicode still block; descriptive prose about attack
commands is allowed; GitHub auth-header allowlist still works.
- `test_skill_with_env_exfil_payload_raises` (planted `cat .env`
in skill body) replaced with `test_skill_with_env_exfil_command
_in_prose_is_allowed` documenting the new correct behavior with
the real-world postmortem-style example that triggered the bug.
- All 11 originally-failing PR-scout jobs validated end-to-end via
`_build_job_prompt` — assembled prompts now build successfully
with the `hermes-agent-dev` skill attached.
Total: 75/75 tests in cron + cronjob_tools + threat scanner pass;
544/544 across the wider cron / memory / threat-pattern surface.
The bug: cron/scheduler.py:_resolve_cron_enabled_toolsets returns an
LLM-supplied per-job enabled_toolsets verbatim. The disabled_toolsets
passed to AIAgent was a hardcoded [cronjob, messaging, clarify] that
ignored agent.disabled_toolsets from config.yaml. An LLM could call
cronjob(action='add', enabled_toolsets=['terminal','file'],
prompt='...') and the cron-spawned agent would receive terminal+file
even when the operator had globally disabled them.
Fix: new _resolve_cron_disabled_toolsets() helper that ALWAYS layers
agent.disabled_toolsets on top of the cron baseline. AIAgent's
disabled_toolsets takes precedence over enabled_toolsets, so this
stops the bypass regardless of what the per-job override contains.
This is the disabled-side fix. Three concurrent PRs (#25842, #25815,
#25780) proposed intersection-side variants on _resolve_cron_enabled_toolsets;
this fix is more robust because it stops the leak at the precedence
boundary AIAgent itself enforces, not at a layer above.
Regression test reproduces the issue's PoC exactly:
config.yaml has agent.disabled_toolsets=[terminal,file]; cron job has
enabled_toolsets=[web,terminal,file]; assertion: AIAgent receives
disabled_toolsets containing terminal AND file.
Salvaged from PR #25786 by @Schrotti77. Simplified the implementation:
dropped a 23-line _normalize_toolset_list() helper (handled str/tuple/
set/garbage input shapes) in favor of the existing convention
(agent_cfg.get('disabled_toolsets') or []) used elsewhere in the
codebase. YAML always parses these as lists; the elaborate normalizer
was theatre for shapes we never produce.
Closes#25752
Co-authored-by: teknium1 <127238744+teknium1@users.noreply.github.com>
ntfy now ships as a self-contained plugin under plugins/platforms/ntfy/
instead of editing 8 core files (gateway/config.py Platform enum,
gateway/run.py factory + auth maps, cron/scheduler.py, toolsets.py,
hermes_cli/status.py, agent/prompt_builder.py, gateway/channel_directory.py,
tools/send_message_tool.py).
All routing goes through gateway/platform_registry via register_platform():
- adapter_factory, check_fn, validate_config, is_connected
- env_enablement_fn seeds PlatformConfig.extra from NTFY_* env vars so
gateway status reflects env-only setups without instantiating httpx
- standalone_sender_fn handles deliver=ntfy cron jobs when cron runs
out-of-process from the gateway
- allowed_users_env / allow_all_env hook into _is_user_authorized
- cron_deliver_env_var=NTFY_HOME_CHANNEL for cron home routing
- platform_hint surfaces in the system prompt
- pii_safe=True (topic names are the only identifier; no PII to redact)
Tests moved to tests/gateway/test_ntfy_plugin.py using _plugin_adapter_loader
so the module lives under plugin_adapter_ntfy in sys.modules and cannot
collide with sibling plugin-adapter tests on the same xdist worker. The
core-file grep tests (Platform.NTFY in source, hermes-ntfy in toolsets,
etc.) are replaced with plugin-shape tests covering register() metadata,
env_enablement_fn output, and standalone_sender_fn behavior.
68 tests pass under scripts/run_tests.sh.
Add an official, production-grade WhatsApp integration via Meta's
Business Cloud API as a complement to the existing Baileys bridge.
No bridge subprocess, no QR codes, no account-ban risk — at the cost
of a Meta Business account and a public HTTPS webhook URL.
Setup is fully wizard-driven: 'hermes whatsapp-cloud' walks through
every credential with paste-time validation (catches the #1 trap of
pasting a phone number into the Phone Number ID field), generates a
verify token, and ends with copy-paste instructions for the
cloudflared / Meta-dashboard / Business Manager pieces that can't be
automated. The wizard also points users at Meta's Business Manager
for setting the bot's display name and profile picture.
Feature set:
- Inbound: text, images (with native-vision routing), voice notes
(STT), documents (small text inlined, larger cached), reply context.
- Outbound: text with WhatsApp-flavored markdown conversion, images,
videos, documents, opus voice notes via ffmpeg with MP3 fallback.
- Native interactive buttons for clarify, dangerous-command approval,
and slash-command confirmation flows — matches the Telegram /
Discord UX, graceful degrades to plain text.
- Read receipts (blue double-checkmarks) and typing indicator,
using Meta's combined endpoint so they fire in a single API call.
- Webhook security: X-Hub-Signature-256 HMAC verification (raw body,
constant-time), wamid deduplication, group-shaped-message refusal
(groups deferred to v2 — Baileys still covers them).
- Full integration with the gateway's session, cron, display-tier,
prompt-hint, and auth-allowlist systems. Cloud and Baileys can run
side-by-side against different phone numbers.
Also wires STT (speech-to-text) through Nous's managed audio gateway
for Nous subscribers — previously the default stt.provider=local
required a separate faster-whisper install. New subscribers now get
voice-note transcription out of the box.
Docs: 418-line user guide at website/docs/user-guide/messaging/
whatsapp-cloud.md, sidebar entry, environment-variables reference,
ADDING_A_PLATFORM.md updated with the optional interactive-UX
contract for future adapter authors.
Tests: 100 dedicated tests for the adapter, 32 for the setup wizard,
20 for the Nous subscription STT wiring, plus regression coverage
across display_config, prompt_builder, and the cron scheduler.
Known limitations (deferred until clear demand signal):
- Group chats — use the Baileys bridge if you need them.
- Message templates for 24-hour-window outside-conversation sends —
reactive chat is unaffected; cron / delegate_task with gaps > 24h
will fail with a clear error. The agent's system prompt warns the
model about this so it knows to mention it when scheduling delayed
messages.
Apply CREATE_NO_WINDOW flags when the cron scheduler launches job scripts on Windows so gateway-managed no-agent cron jobs do not flash cmd or python console windows every tick.
1. trajectory_compressor.py: yaml.safe_load() returns None on empty
files, crashing with TypeError on `if 'tokenizer' in data`. Fix by
adding `or {}` fallback. (HIGH — blocks startup with empty config)
2. 6 files with fcntl.flock(LOCK_UN) in finally blocks without
try/except: cron/scheduler.py, hermes_cli/auth.py,
agent/shell_hooks.py, tools/skill_usage.py,
tools/environments/file_sync.py, tools/memory_tool.py. If unlock
raises OSError, fd.close() is skipped and the lock is held forever.
The msvcrt branches already had try/except; the fcntl branches did
not. Fix by wrapping in try/except (OSError, IOError): pass.
3. agent/copilot_acp_client.py line 639: TOCTOU race — path.exists()
followed by path.read_text() with no try/except. If file is deleted
between the check and the read, FileNotFoundError propagates. Fix
by using try/except FileNotFoundError.
4. gateway/sticker_cache.py: non-atomic write via Path.write_text()
can leave truncated JSON on crash, causing JSONDecodeError on next
load. Fix by writing to tempfile + fsync + os.replace (atomic).
When Telegram topic mode is enabled, cron messages delivered to the bot's
root DM (TELEGRAM_HOME_CHANNEL without a thread id) land in the system
lobby — replies there are rebuffed with the lobby reminder and
reply_to_message_id is dropped, so users cannot interact with the cron
output (#24409).
Add an optional TELEGRAM_CRON_THREAD_ID env var that overrides
TELEGRAM_HOME_CHANNEL_THREAD_ID for cron deliveries only. Operators can
create a "Cron" forum topic in the DM, point this var at its thread id,
and replies to cron messages will land in that topic's existing session
instead of the lobby. The home-channel thread id (used elsewhere, e.g.
restart notifications) is unchanged, and explicit
deliver="telegram:chat:thread" targets continue to win over the env var.
Per the reporter's clarification on 2026-05-13, option (a) (cron-side
route to a dedicated topic + config knob) was chosen.
Fixes#24409
Two code paths call json.loads() on output from external tools without
catching JSONDecodeError. If the tool returns a non-JSON string (error
message, empty string, or None), the entire call path crashes.
1. gateway/run.py — text_to_speech_tool() result in voice reply path.
A TTS failure that returns an error string instead of JSON crashes
the voice reply handler, killing the message response entirely.
2. cron/scheduler.py — skill_view() result when loading skills for
cron jobs. A corrupted or missing skill file that returns an error
string instead of JSON crashes the cron tick, preventing all jobs
from executing that cycle.
Both fixes catch (json.JSONDecodeError, TypeError), log a warning,
and gracefully skip the failed operation instead of crashing.
Instead of raising FileNotFoundError (which silently bricks the job),
log a warning and fall back to the scheduler default home. Validates
at create/update time still catches typos. Idea from PR #19958.
Replace generator-based result collection with explicit per-future
handling. Each future is now processed independently with a 600s timeout.
Before: _results.extend(f.result() for f in _futures)
- One exception stops the generator, remaining results are lost
- No timeout: one hung job blocks the entire tick
After: as_completed() + per-future try/except
- Each future handled independently
- 600s timeout prevents indefinite blocking
- Failed futures are logged and counted as failures