Commit graph

200 commits

Author SHA1 Message Date
Teknium
bb7ff7dc30
revert(cron): return cron job storage to per-profile (reverts #32117 + #50993) (#51116)
* Revert "fix(cron): scope job execution to its owning profile (#32091 follow-up) (#50993)"

This reverts commit 660e36f097.

* Revert "fix(cron): anchor cron storage at the default root home (not the active profile)"

This reverts commit a5c09fd176.
2026-06-22 17:53:50 -07:00
Teknium
660e36f097
fix(cron): scope job execution to its owning profile (#32091 follow-up) (#50993)
The #32091 fix moved every profile's cron jobs into one shared root store,
but never wired the execution-scoping half it recommended: a job still ran
under whichever profile's ticker picked it up, not its owning profile. So a
job created under `hermes -p donna` could execute with the root profile's
.env / config.yaml / credentials.

- jobs.py: create_job auto-captures the active profile (explicit profile=
  override available) and stores it on the job; resolve_profile_home() maps a
  profile name to its HERMES_HOME; legacy jobs backfill to 'default'.
- scheduler.py: run_job applies the job's profile via a scoped HERMES_HOME
  override (env var + in-process ContextVar) before any .env/config/script
  load, restored in finally. tick() routes profile-mismatched jobs to the
  single-worker sequential pool so the env mutation can't race.
- cronjob tool threads profile through (NOT exposed in the model schema, to
  avoid cross-profile privilege escalation); hermes cron add gains --profile.

E2E verified against a temp HERMES_HOME with a real profile dir: a root-profile
ticker runs a profile='donna' job with HERMES_HOME=donna during execution and
restores the ticker env afterward.
2026-06-22 14:54:28 -07:00
helix4u
ae7e857420 fix(cron): deliver max-iteration fallback reports 2026-06-22 13:57:59 -07:00
sherman-yang
74a5905aea fix(cron): layer enabled MCP servers onto per-job enabled_toolsets
A cron job that sets `enabled_toolsets` to a list of *native* toolsets (e.g.
`["web", "terminal"]`) silently got ZERO MCP tools, while a job with no
per-job list got every globally-enabled MCP server. `_resolve_cron_enabled_
toolsets` returned the per-job list verbatim, bypassing the MCP-merge that the
platform-fallback branch performs via `_get_platform_tools`. So
`discover_mcp_tools()` registered the MCP tools into the registry, but
`get_tool_definitions(enabled_toolsets=...)` kept only the named native
toolsets — the agent then rejected every `mcp_*` call as "Unknown tool". (R2
of #23997.)

Fix: `_merge_mcp_into_per_job_toolsets` layers MCP membership onto a per-job
allowlist with the SAME semantics as `_get_platform_tools`:
  * `no_mcp` sentinel present -> no MCP servers (sentinel stripped)
  * one or more MCP server names already listed -> treat as an allowlist
  * otherwise -> union in every globally-enabled MCP server

To avoid duplicating the "which MCP servers are enabled" computation (it
already existed inline in `_get_platform_tools`), this extracts a shared
`enabled_mcp_server_names(config)` helper in `hermes_cli.tools_config` and has
BOTH the gateway/CLI platform resolver and the cron per-job resolver call it —
so every path agrees on MCP membership (extend, don't duplicate).

Note: the issue's *headline* — bare MCP server names rejected, registry never
includes them — was already fixed on main (commits c10fea8d2 + 04918345e,
both before the issue was filed). This PR closes the remaining cron-specific
gap (R2). The `server:*` / `mcp:server` alias-notation rejection (R1) and the
quiet-mode silent-drop (R3) are tracked separately.

Salvaged from #32788 by sherman-yang (credited below). Reworked to reuse the
shared `enabled_mcp_server_names` helper instead of re-implementing the MCP
membership set in cron/scheduler.py.

Fixes #23997

Co-authored-by: sherman-yang <58446328+sherman-yang@users.noreply.github.com>
2026-06-22 15:52:58 +05:30
mohamedorigami-jpg
a5c09fd176 fix(cron): anchor cron storage at the default root home (not the active profile)
`cron/jobs.py` resolved `HERMES_DIR`/`JOBS_FILE` from `get_hermes_home()`,
which follows the active profile override. So a job created from a
profile-scoped agent session (`hermes -p myprofile chat`, where the in-process
`cronjob` tool calls `create_job`) was written to
`~/.hermes/profiles/myprofile/cron/jobs.json`, while the profile-less gateway
(`hermes gateway run`) reads only `~/.hermes/cron/jobs.json`. The job was
silently orphaned: `cronjob action=list` from the same profile reported it
healthy (same file), but the gateway ticker never saw it and it never fired.
`last_run_at` stayed null forever. (#32091)

Fix: resolve the cron store from `get_default_hermes_root()` — the
purpose-built "profile-level operations" root that returns `<root>` even when
`HERMES_HOME` is `<root>/profiles/<name>` (and handles Docker/custom layouts).
Now the creator, the gateway scheduler, and the dashboard all agree on a
single jobs.json at the root, so a job created under any profile is visible to
the gateway.

Scope: this is the storage-location half of the fix. Making a job *execute*
under its originating profile's config/skills (a per-job `profile` field +
runtime context scoping, the #48649 sibling) is a separate, riskier change and
will follow as its own PR — keeping this layer minimal and safe.

Salvaged from #32117 by @mohamedorigami-jpg (authorship preserved). The
comprehensive #33839 (@sweetcornna) takes the same Option-A storage approach
and additionally adds the per-job profile execution scoping; this PR lands the
safe storage layer first.

Tests: `tests/cron/test_cron_profile_storage.py` — asserts the store anchors
at `<root>/cron` under a profile HERMES_HOME (not `<profile>/cron`), and is
unchanged when no profile is active. Full `tests/cron/` suite: 511 passed.

Fixes #32091

Co-authored-by: mohamedorigami-jpg <mohamed.origami@gmail.com>
2026-06-21 16:45:14 +05:30
kshitijk4poor
4cc28aa3bb fix(cron): route Telegram DM-topic cron delivery through DeliveryRouter (#22773)
PR #22410 added three-mode Telegram topic routing to the live message path
(TelegramAdapter.send via the gateway DeliveryRouter), but the cron delivery
path never got it. cron/scheduler.py::_deliver_result sent through the live
adapter with a bare ``{"thread_id": ...}`` and fell back to the standalone
_send_telegram, neither of which addresses Bot API Direct Messages topics
correctly. After Bot API 10.0 (2026-05-08), sending to a private chat with a
bare ``message_thread_id`` is rejected/mis-routed, so cron deliveries to a
private DM topic landed in the General topic instead of the requested lane.

Fix: the cron live-adapter branch now routes the text send through the
gateway's ``DeliveryRouter._deliver_to_platform`` — the same canonical path
live messages use — so it inherits all three Telegram routing modes:

  1. Forum/supergroup (negative chat_id) -> message_thread_id
  2. Bot API DM topics (private chat_id + numeric topic id) ->
     direct_messages_topic_id  (the case #22773 reported)
  3. Hermes-created named private DM-topic lanes -> ensure_dm_topic +
     reply anchor

For mode 2, a private-chat target with a numeric topic id is passed as
``direct_messages_topic_id`` metadata (verified end-to-end:
TelegramAdapter._thread_kwargs_for_send turns it into
``{message_thread_id: None, direct_messages_topic_id: <int>}``), instead of a
bare message_thread_id. Forum/supergroup and home-channel deliveries are
unchanged. The standalone fallback (gateway down) is preserved.

No new config knob and no duplicated routing logic — this reuses the existing
DeliveryRouter rather than reimplementing topic routing in the cron path.

Salvaged from #42051 (stepanov1975) and #23249 (devsart95), which both
diagnosed the missing three-mode routing in the cron/standalone path;
reimplemented onto the canonical DeliveryRouter that landed since those PRs
were opened.

Co-authored-by: Alex <9785479+stepanov1975@users.noreply.github.com>
Co-authored-by: devsart95 <devsart95@gmail.com>
2026-06-21 13:35:45 +05:30
kshitij
02a3288de3
Merge pull request #50018 from NousResearch/salvage/f3a-delivery-confirm
fix(cron): make live-adapter delivery confirmation reliable (#38922, #47056, #43014)
2026-06-21 13:29:45 +05:30
annguyenNous
07424da76f fix(cron): keep ticker alive on BaseException + heartbeat-aware status
The in-process cron ticker (cron/scheduler_provider.py) caught only
`Exception` and logged at DEBUG, so a `SystemExit`/`KeyboardInterrupt`
raised from a misbehaving provider SDK or agent retry path killed the
ticker thread silently. The gateway PROCESS stayed up, so `hermes cron
status` — which only checks `find_gateway_pids()` — kept reporting
"✓ jobs will fire automatically" while no jobs ever fired (#32612,
#32895).

This makes ticker death survivable and detectable:

- The ticker loop now catches `BaseException` and logs at ERROR with a
  traceback, so a single bad tick no longer tears the thread down and
  the failure is visible in the gateway log.
- The loop records a heartbeat (`cron/ticker_heartbeat`, epoch seconds)
  on startup and after every tick — best-effort, never raised into the
  loop. Both ticker entry points (the gateway and the desktop fallback
  in web_server.py) funnel through `InProcessCronScheduler.start`, so one
  heartbeat site covers both.
- `hermes cron status` now reads the heartbeat age: if the gateway is
  running but the heartbeat is stale (> 200s, i.e. several missed ~60s
  ticks), it reports the ticker as STALLED and suggests a restart instead
  of falsely claiming jobs will fire. A missing heartbeat (older build /
  never ran) is treated as "unknown", not "dead".

Adds tests for BaseException survival, per-iteration heartbeat recording,
heartbeat round-trip/age, staleness detection, and silent-write-failure.

Salvaged from #49660 (BaseException survival on current structure),
extended with the heartbeat + honest-status reporting that the earlier
(pre-refactor) watchdog PRs #35616 and #33849 proposed.

Fixes #32612
Fixes #32895

Co-authored-by: banditburai <promptsiren@gmail.com>
Co-authored-by: sweetcornna <96944678+sweetcornna@users.noreply.github.com>
2026-06-21 13:00:50 +05:30
Luke The Dev
d54890870f fix(cron): make live-adapter delivery confirmation reliable (#38922, #47056, #43014)
Consolidates three cron-delivery defects in cron/scheduler.py::_deliver_result
that all stem from how the live-adapter send result is interpreted.

#38922 — duplicate message on confirmation timeout.
  future.result(timeout=60) raising TimeoutError bubbled to the outer
  except handler, which left delivered=False, so `if not delivered:` re-sent
  the identical message via the standalone path. future.cancel() cannot
  un-send a request already in flight on the wire, so a slow confirmation
  deterministically produced a duplicate. The send was already dispatched onto
  the gateway loop, so a bare timeout is now treated as delivered
  (assume-delivered is safer than guaranteed-duplicate) and the standalone
  fallback is skipped. The live-adapter media attempt is also skipped on
  timeout since the contended loop would re-block each 30s media budget.

#47056 — silent drop when the gateway has an active session.
  The old check `if send_result is None or not getattr(send_result,
  "success", True)` let a result object missing a `success` attribute default
  to True = counted as a successful delivery, so the scheduler logged
  "delivered via live adapter" while the gateway never processed the message.
  Delivery is now confirmed via _confirm_adapter_delivery(): only an explicit,
  truthy `success` attribute counts; None or a `success`-less object falls
  through to the standalone path so the message actually arrives.

  A genuine send Exception (not a slow confirmation) still falls through to
  the standalone path, and is caught by run_job's outer handler — it is
  recorded as the job's last_error and never crashes the cron ticker.

#43014 — deliver=origin fails to resolve in CLI sessions.
  A CLI-created job has no {platform, chat_id} origin, so deliver=origin (and
  auto-detect / deliver=None) was unresolvable and emitted "no delivery target
  resolved" on every run. An unresolvable origin with no configured home
  channel is now treated as local (output stays in last_output), matching the
  documented auto-deliver contract; a concrete unresolvable platform target
  still reports a real error.

Salvaged from #41007 (timeout discriminator), folding in #47127's
_confirm_adapter_delivery hardening and #38937 / #43063's origin→local
fallback. Tests rewritten as behavior contracts (timeout => no duplicate;
None / success-less result => standalone fallback; confirmed success => no
fallback; CLI origin => local, explicit platform => still errors).

Co-authored-by: Evi Nova <66773372+Tranquil-Flow@users.noreply.github.com>
Co-authored-by: kyssta-exe <kyssta-exe@users.noreply.github.com>
2026-06-21 12:59:21 +05:30
konsisumer
73b92264ee fix(cron): resolve model.default + fail fast on missing model
Cron jobs created without an explicit `model` are stored as `model: null`.
At fire time `run_job` resolved `model = job.get("model") or os.getenv(
"HERMES_MODEL") or ""` and then `_model_cfg.get("default", model)`, so when
config.yaml had no `model.default` (or `model: {default: null}`) an empty
string flowed straight to the provider and surfaced as an opaque HTTP 400
("Model parameter is required" / "model: String should have at least 1
character"). The operator had to inspect jobs.json to discover the job was
stored with a null model.

This change makes cron model resolution robust and symmetric with the CLI:

- Coerce `model: null`/missing config to `{}` so a falsy default never
  overwrites an already-resolved env value with `None`.
- Only overwrite `model` from `model.default` when the resolved value is
  truthy; accept a `model.model` alias key, mirroring the sibling resolvers
  in hermes_cli/oneshot.py, fallback_cmd.py and prompt_size.py.
- Resolve AFTER the managed-scope overlay so an administrator-pinned model
  still wins.
- Fail fast with an actionable error (caught by run_job's outer handler and
  recorded as the job's last_error — the cron ticker is unaffected) instead
  of letting an empty model reach the API.
- The per-job model is re-read every tick, so a `cronjob action=update
  model=...` after a failed run takes effect on the next tick (no cache).

Adds tests/cron/conftest.py pinning a default HERMES_MODEL so existing
run_job tests don't trip the new guard, plus regression tests covering env
fallback, config.default fallback, string-form config, the model alias key,
null-default-no-clobber, corrupt-config graceful degradation, fail-fast,
and the no-cache re-read property.

Salvaged from #24005, rebased onto current main, with additional test
coverage folded in from #45550 and the alias-key behavior from #43952.

Fixes #43899
Fixes #23979
Fixes #22761

Co-authored-by: szzhoujiarui-sketch <szzhoujiarui@gmail.com>
Co-authored-by: rayjun <rayjun0412@gmail.com>
2026-06-21 12:37:56 +05:30
teknium1
c1a0b6a5f1 style: strip trailing whitespace in cron scheduler live-adapter block
Follow-up on salvaged PR #49280.
2026-06-19 16:59:38 -07:00
joaomarcos
3a6c171e9e fix(gateway): log signal transport response and bubble cron live adapter errors 2026-06-19 16:59:38 -07:00
kshitijk4poor
8dc0b18894 refactor(cron): copy os.environ before sanitizing for subprocess
Matches the env= callsite convention at the other sanitized
subprocess spawns (cua_backend dict(os.environ), gateway
os.environ.copy()). Functionally equivalent — _sanitize_subprocess_env
never mutates its input — but avoids handing the live mapping to the
helper.

Follow-up to salvaged PR #49207.
2026-06-20 00:29:46 +05:30
0z1-ghb
da7253215d fix(cron): sanitize env for job script subprocesses
Cron no_agent and pre-check scripts ran with the full gateway/agent
environment, allowing scripts under HERMES_HOME/scripts/ to read provider
credentials. Apply _sanitize_subprocess_env like terminal and MCP paths
(SECURITY.md section 2.3).

Add regression test asserting blocklisted provider vars are absent in the
child process.
2026-06-20 00:13:11 +05:30
Teknium
2a5e9d994a
Merge pull request #48275 from NousResearch/feat/cron-scheduler-provider-chronos
feat(cron): pluggable CronScheduler interface + Chronos managed-cron provider (scale-to-zero)
2026-06-19 07:51:59 -07:00
Ben
b0e47a98f9 fix(managed-scope): honor managed scope in all standalone config loaders
The skin bug was one instance of a class: several subsystems build their
config dict directly from config.yaml instead of routing through
hermes_cli.config.load_config (which carries the managed merge), so they
silently ignored administrator-pinned values. Audited every config.yaml
reader and fixed the behavioral-read bypasses:

- gateway/config.py load_gateway_config (messaging gateway: session_reset,
  quick_commands, stt, model, ...)
- gateway/run.py _load_gateway_config (its read_raw_config fast path also
  skipped the merge — read_raw_config returns raw user YAML)
- tui_gateway/server.py _load_cfg (new TUI + desktop backend: skin,
  reasoning_effort, service_tier, provider_routing)
- cron/scheduler.py (scheduled-job model/reasoning/toolsets/provider_routing)
- hermes_logging.py (logging.level/max_size_mb/backup_count)
- hermes_time.py (timezone)
- hermes_cli/doctor.py (memory-provider diagnostic reads effective config)

All route through a new shared managed_scope.apply_managed_overlay() helper
that mirrors _load_config_impl (env-only expansion so a user ${VAR} can't
shadow a managed literal, root-model-string normalization, leaf-merge) and is
fail-open. cli.py's earlier inline fix is refactored onto the same helper.

Write-back paths (slash_commands, telegram/yuanbao dm_topics, profile
distribution) are deliberately left reading raw user YAML — overlaying managed
values there would persist them into the user file. The dashboard
(web_server.py) already routes through load_config and needed no change.

TUI loader caches the RAW config so _save_cfg never writes managed values to
disk. Adds test_managed_scope_overlay.py (helper) and
test_managed_scope_loaders.py (per-surface integration); mutation-checked.
2026-06-19 07:46:33 -07:00
teknium1
a58287afcb
Merge remote-tracking branch 'origin/main' into pr48275-rebase
# Conflicts:
#	cron/scheduler.py
2026-06-19 07:40:29 -07:00
Sahil Saghir
a5e06078b2 fix(cron): compact cron failure messages + repair bare repo dirs after git gc
Two small, focused fixes for the cron scheduler and checkpoint manager.

1. _summarize_cron_failure_for_delivery (cron/scheduler.py):
   Replaces the raw error dump in _process_job with a compact
   pattern-matched summary. Provider rate limits, timeouts, and
   authentication errors now produce a short human-readable message
   instead of dumping multi-KB provider JSON into the delivery channel.

2. _repair_bare_repo_dirs (tools/checkpoint_manager.py):
   Recreates refs/heads/ and branches/ directories after git gc
   --prune=now, which can remove empty dirs from bare repos and cause
   subsequent git add -A to fail with 'fatal: not a git repository'.
   Called after all four git gc call sites.

Both fixes use only standard library imports and plug into existing
call sites with no architectural changes.
2026-06-19 07:35:29 -07:00
Ben
b75757d4aa feat(cron): wire on_jobs_changed, cron.chronos config, docs + agent↔NAS contract
Phase 4F (F.1 + F.2 + F.3, agent side). F.4 is the operator-run live smoke
(needs a NAS deployment); recorded in the PR, not code.

F.1 — on_jobs_changed wiring:
- cron/scheduler.py: _notify_provider_jobs_changed() — resolve the active
  provider, call on_jobs_changed(), swallow errors. Lives in scheduler.py (not
  jobs.py) so the store stays free of provider imports (no import cycle).
- Wired at the consumer surfaces AFTER a successful mutation: the cronjob model
  tool (tools/cronjob_tools.py, create/update/remove/pause/resume) — which the
  `hermes cron` CLI also routes through — and the REST handlers
  (gateway/platforms/api_server.py, same five). Built-in's no-op default = zero
  behavior change on the default path. Sleeping-agent direct jobs.json writes
  (no tool/CLI/REST) are covered by reconcile-on-wake in start().

F.2 — config: cron.chronos.{portal_url,callback_url,expected_audience,
nas_jwks_url}. All non-secret; the agent holds no scheduler creds and the
outbound provision call reuses the existing Nous token (no token key). Additive
deep-merge key, no version literal.

F.3 — docs:
- docs/chronos-managed-cron-contract.md: authoritative agent↔NAS wire contract
  (the three agent-cron endpoints + inbound /api/cron/fire + the 3-hop trust
  model + at-most-once/re-arm semantics). This is what the NAS-side agent builds
  against.
- cron-internals.md: "Managed cron (Chronos) for scale-to-zero" section.
- cli-commands.md: cron.provider accepts chronos + the cron.chronos.* keys.
- User docs name no scheduler vendor (QStash is a NAS-internal detail).

INVARIANT re-verified: zero qstash/upstash hits across plugins/cron, gateway,
hermes_cli, tools, website/docs (the one remaining repo hit is an unrelated
Context7 MCP comment in tools/mcp_tool.py).

Tests: test_jobs_changed_notify (5) — notify calls provider hook, swallows
errors, built-in harmless, tool create/remove notify. Full cron + chronos +
webhook + config + api_server_jobs suites green (504 in the cron+chronos+webhook
run).
2026-06-18 15:11:32 +10:00
Ben
58b19a4f69 refactor(cron): extract run_one_job shared firing helper from tick
Phase 4A. Factor tick's per-job closure (_process_job: execute → save →
deliver → mark) into a module-level run_one_job(job, *, adapters, loop,
verbose) so the external Chronos provider's fire_due (Phase 4D) reuses the
IDENTICAL body — no duplicated correctness. tick's _process_job is now a thin
wrapper calling run_one_job; the pool/in-flight-guard/contextvars dispatch
logic is unchanged.

run_one_job fires ONE given job; it does NOT decide due-ness, claim, or compute
next_run (tick advances next_run_at under the file lock; an external provider
claims via the store CAS in Phase 4C). Pure refactor, no behavior change.

TDD: test_run_one_job.py characterizes the sequence through tick() first
(test_tick_process_job_sequence, passed pre-extraction), then unit-tests the
helper directly: success sequence, [SILENT]→skip delivery, empty-response soft
failure (#8585), failed-job-still-delivers, exception→mark-failed.

Verified: tests/cron/ 459 passed (was 453 + 6 new); tick behavior unchanged.
2026-06-18 14:26:29 +10:00
Teknium
2ecb4e62bb
Merge remote-tracking branch 'origin/main' into hermes/hermes-6b48295e 2026-06-11 07:38:25 -07:00
Teknium
7d8d000b19
revert(cron): remove per-job profile support (PR #28124) (#43956)
Fully removes the cron per-job 'profile' arg added in #28124: the
cronjob tool schema field, CLI --profile flags on cron create/edit,
job-record storage/validation, the scheduler's _job_profile_context
wrapper, and the script-runner env override. Sequential-partition
logic reverts to workdir-only.

The context-local HERMES_HOME override in hermes_constants and the
subprocess bridging in tools/environments/local.py are kept — they
now have other consumers (dashboard multi-profile, TUI gateway).
2026-06-10 20:46:17 -07:00
emozilla
bfcc9f92b4 Merge commit '6110aed9b' into feat/whatsapp-cloud-api 2026-06-10 21:39:22 -04:00
Siddharth Balyan
b4170f3ac2
fix(cron): don't strict-scan script-injected output in no-skills jobs (#43223)
The runtime assembled-prompt scan (#3968 lineage) selected its pattern
tier on has_skills alone. A script-driven, no-skills job injects its
script's stdout into the prompt, and that blob was scanned with the
STRICT user-prompt pattern set — so any command-shape string in the
data feed (e.g. a triage bot ingesting a bug report that quotes
`rm -rf /`) hard-blocked the job on every tick.

Script output and context_from output are runtime DATA produced by
operator-authored code — the same trust class as install-vetted skill
markdown, not a user-authored directive prompt. Select the scan tier by
what the assembled prompt CONTAINS: when it includes skill content OR
injected data, use the looser _scan_cron_skill_assembled set (keeps
unambiguous injection directives, drops command-shape patterns,
sanitizes invisible unicode instead of blocking).

Defense-in-depth is preserved:
- The raw user prompt is still strict-scanned at create/update
  (api_server paths untouched) AND re-scanned strict at runtime even
  when the looser tier was selected for the data blob.
- Plain no-script/no-skills jobs keep the strict scan on the whole
  assembled prompt.
- Injection directives arriving via script stdout still block.

Rejected alternative: removing destructive_root_rm from the strict set
or a per-job skip_injection_scan flag — both weaken the guard globally.
2026-06-10 08:27:24 +05:30
Brooklyn Nicholson
ad0f6db151 feat(cron): title cron sessions from the job, not the [IMPORTANT] hint
A cron session's first message is the injected "[IMPORTANT: you are running as
a scheduled cron job …]" delivery hint, so with no explicit title the sidebar
and history rows fell back to that hint as their label.

Set the session title from the job (name → short prompt → id) with a run-time
suffix for uniqueness against the sessions.title index. Done after the run so
the agent's own INSERT keeps model/system_prompt — this only updates the title.
2026-06-06 12:51:12 -05:00
Teknium
50f9ad70fc
fix(dashboard): populate cron delivery dropdown from configured platforms (#40218)
* fix: respect disabled auto-compaction on context overflow

Port from anomalyco/opencode#30749.

When compression.enabled is false, NO automatic compaction trigger may
fire. The proactive token-threshold paths (preflight + post-response
should_compress gate) already honoured the setting, but the three
provider-overflow recovery paths in the agent loop — long-context-tier
429, 413 payload-too-large, and context-overflow — called
_compress_context() unconditionally, silently compressing and rotating
the session against the user's explicit choice.

Add a single guard at the top of the overflow-recovery dispatch: when
compression is disabled and the error is one of those three overflow
classes, surface a terminal error (compaction_disabled: True) telling the
user to /compress manually, /new, switch to a larger-context model, or
reduce attachments. Manual /compress (force=True) is unaffected — it never
enters this loop.

Tests: new TestOverflowWithCompactionDisabled (413 + 400 overflow don't
compress when disabled; control case still compresses when enabled).
Existing overflow-recovery tests updated to enable compaction explicitly
(they verify the recovery fires); fixture defaults flipped to True to
match production (compression.enabled defaults to True).

* fix(dashboard): populate cron delivery dropdown from configured platforms

The dashboard cron-create/edit dropdown hardcoded five delivery options
(local, telegram, discord, slack, email), so users on Matrix — or any
other backend-supported platform — had no way to pick their channel even
though the cron scheduler delivers to all of them. It also offered
Telegram/Discord/etc. to users who never set those up.

- cron/scheduler.py: add cron_delivery_targets() — the single source of
  truth. Intersects gateway-configured platforms with cron-deliverable
  ones and reports whether each platform's home channel is set.
- web_server.py: GET /api/cron/delivery-targets exposes that list (+ the
  implicit local option) to the dashboard.
- CronPage.tsx: both modals render options from the endpoint. Configured
  platforms missing a home channel still appear, annotated "set a home
  channel first" (option B), so the user knows what to fix. Edit modal
  preserves a job's current target even if it's no longer configured.
  Local-only state shows a "configure a platform under Channels" hint.

Validation: scheduler + endpoint E2E'd with a Matrix gateway (home set
and unset); 5 new tests; tests/cron + tests/hermes_cli/test_web_server
green (366 passed).
2026-06-05 20:23:54 -07:00
Teknium
9fbfeb31b9 fix(cron): make sequential jobs non-blocking too + sweep MCP after jobs finish
Follow-up on the parallel-dispatch decoupling: the sequential pass for
workdir/profile jobs still ran inline in the ticker thread, so a long
workdir/profile job reintroduced the exact starvation #37312 describes,
just for env-mutating jobs. And the MCP orphan sweep ran immediately
after dispatch in sync=False mode — before jobs finished — defeating its
own 'runs after every job' contract and racing jobs still spawning MCP
children.

- Sequential jobs now queue to a persistent single-thread cron-seq pool
  (preserves one-at-a-time ordering across ticks, never blocks the tick).
- Same in-flight dedup guard now covers sequential jobs.
- MCP orphan sweep runs via a done-callback after the LAST dispatched job
  completes in async mode; inline after as_completed in sync mode.

Verified E2E: tick(sync=False) returns in ~1ms with a 1.5s sequential job
in flight; sweep fires only after that job ends.
2026-06-04 05:40:13 -07:00
Vynxe Vainglory
eb9cde7346 fix(cron): decouple job dispatch from completion in tick()
PR #13021 fixed serial starvation by adding ThreadPoolExecutor to tick(),
but kept as_completed(timeout=600) which still blocks the ticker thread
until the slowest job finishes. This causes the same starvation pattern:
when one job runs long (15+ min), other jobs' next_run_at expires past the
grace window and they get perpetually fast-forwarded instead of running.

This PR decouples dispatch from completion:
- Persistent ThreadPoolExecutor (reused across ticks, no auto-join)
- Fire-and-forget dispatch: tick submits and returns immediately
- Running-job guard: prevents re-dispatching active jobs
- sync parameter: defaults to True (backward compatible), callers opt
  into sync=False for non-blocking behavior
- atexit shutdown handler for clean pool teardown
- gateway/run.py: production ticker opts into sync=False

Refs #33315 (complementary — that issue's PRs fix grace handling in
jobs.py; this PR prevents the grace from expiring in the first place)
2026-06-04 05:40:13 -07:00
helix4u
ffb53767bf fix(config): align prefill messages key handling 2026-06-03 23:51:44 -06:00
Vinoth
ae5b2de2fa fix: expand skill bundles in cron jobs 2026-06-02 18:39:28 -07:00
Teknium
2c0d648397
fix(cron): sanitize invisible unicode in vetted skill content instead of hard-blocking (#37245)
A stray zero-width space (U+200B), BOM, or bidi control in loaded skill
markdown permanently killed any cron that loaded it. The skills-attached
assembled-prompt scan hard-blocked on any invisible-unicode char, even
though skill bodies are already install-time vetted by skills_guard.py and
the chars commonly appear in copy-pasted unicode docs / code examples.

The skills path now strips invisibles (logging the codepoints) and runs the
cleaned prompt. The raw user-prompt path (_scan_cron_prompt) keeps the hard
block — that is the actual #3968 injection surface, where a small directive
prompt with a ZWSP is a smoking gun, not prose. Stripping does not let a real
injection slip through: the directive still matches after sanitization.

_scan_cron_skill_assembled now returns (cleaned_prompt, error).
2026-06-02 00:29:44 -07:00
Teknium
ccd899318e
fix(cron): split scanner into two tiers so skill prose stops false-positiving (#32339)
The runtime cron prompt scanner (added in #3968 to plug the
"malicious skill carrying an injection payload" gap) reuses the same
critical-severity patterns as the create-time user-prompt scan against
the *assembled* prompt — which includes loaded skill markdown.

That works fine for narrow patterns like "ignore previous instructions"
which never legitimately appear in prose. It catastrophically false-
positives on command-shape patterns like `cat ~/.hermes/.env`,
`authorized_keys`, `/etc/sudoers`, and `rm -rf /`, which routinely
appear in security postmortems and runbooks as **descriptive prose**
about attacks, not as actual commands.

Concrete failure: the bundled `hermes-agent-dev` skill contains a
security postmortem section saying "the attacker could just
`cat ~/.hermes/.env`". Every PR-scout cron job that loaded this skill
was silently blocked with `Blocked: prompt matches threat pattern
'read_secrets'`. All 11 scout jobs failed for weeks.

Fix: split the scanner into two tiers and route by context:

  - `_scan_cron_prompt` (strict, unchanged behavior) runs against
    the small user-authored cron prompt at create/update and as a
    runtime defense-in-depth when no skills are attached. A legit
    user prompt has no business saying `cat .env`, so the strict
    patterns still apply there.

  - `_scan_cron_skill_assembled` (new, looser) runs against the
    assembled prompt when skills are attached. It only catches
    unambiguous prompt-injection directives ("ignore previous
    instructions", "disregard your rules", "system prompt override",
    "do not tell the user") plus invisible-unicode markers. Command-
    shape patterns are dropped because they false-positive on prose.

This is defense-in-depth, not the only line of defense. Skill bodies
are already scanned at install time by `skills_guard.py`; the runtime
cron scan exists purely as a tripwire for an obvious injection
directive surviving a malicious install. Catching prose mentions of
commands was never the goal of #3968 — the test that planted a skill
containing `cat ~/.hermes/.env` was the wrong shape of test for the
threat model.

Tests:
- `_scan_cron_prompt` strict behavior preserved (56 existing tests
  unchanged: bare `cat .env`, `rm -rf /`, etc. still block).
- New `TestScanCronSkillAssembled` class verifies the looser scanner:
  injection / disregard / system-override / do-not-tell-the-user /
  invisible-unicode still block; descriptive prose about attack
  commands is allowed; GitHub auth-header allowlist still works.
- `test_skill_with_env_exfil_payload_raises` (planted `cat .env`
  in skill body) replaced with `test_skill_with_env_exfil_command
  _in_prose_is_allowed` documenting the new correct behavior with
  the real-world postmortem-style example that triggered the bug.
- All 11 originally-failing PR-scout jobs validated end-to-end via
  `_build_job_prompt` — assembled prompts now build successfully
  with the `hermes-agent-dev` skill attached.

Total: 75/75 tests in cron + cronjob_tools + threat scanner pass;
544/544 across the wider cron / memory / threat-pattern surface.
2026-05-25 18:20:45 -07:00
Glen Workman
d952b377aa
fix: add cron API provenance logging (#24889)
Co-authored-by: sgtworkman <178342791+sgtworkman@users.noreply.github.com>
2026-05-25 01:15:56 -07:00
Schrotti77
9863a07af6 fix(cron): layer agent.disabled_toolsets onto cron baseline (#25752)
The bug: cron/scheduler.py:_resolve_cron_enabled_toolsets returns an
LLM-supplied per-job enabled_toolsets verbatim. The disabled_toolsets
passed to AIAgent was a hardcoded [cronjob, messaging, clarify] that
ignored agent.disabled_toolsets from config.yaml. An LLM could call
cronjob(action='add', enabled_toolsets=['terminal','file'],
prompt='...') and the cron-spawned agent would receive terminal+file
even when the operator had globally disabled them.

Fix: new _resolve_cron_disabled_toolsets() helper that ALWAYS layers
agent.disabled_toolsets on top of the cron baseline. AIAgent's
disabled_toolsets takes precedence over enabled_toolsets, so this
stops the bypass regardless of what the per-job override contains.

This is the disabled-side fix. Three concurrent PRs (#25842, #25815,
#25780) proposed intersection-side variants on _resolve_cron_enabled_toolsets;
this fix is more robust because it stops the leak at the precedence
boundary AIAgent itself enforces, not at a layer above.

Regression test reproduces the issue's PoC exactly:
config.yaml has agent.disabled_toolsets=[terminal,file]; cron job has
enabled_toolsets=[web,terminal,file]; assertion: AIAgent receives
disabled_toolsets containing terminal AND file.

Salvaged from PR #25786 by @Schrotti77. Simplified the implementation:
dropped a 23-line _normalize_toolset_list() helper (handled str/tuple/
set/garbage input shapes) in favor of the existing convention
(agent_cfg.get('disabled_toolsets') or []) used elsewhere in the
codebase. YAML always parses these as lists; the elaborate normalizer
was theatre for shapes we never produce.

Closes #25752

Co-authored-by: teknium1 <127238744+teknium1@users.noreply.github.com>
2026-05-25 01:09:54 -07:00
Teknium
6a8e131a0a refactor(ntfy): convert built-in adapter to platform plugin
ntfy now ships as a self-contained plugin under plugins/platforms/ntfy/
instead of editing 8 core files (gateway/config.py Platform enum,
gateway/run.py factory + auth maps, cron/scheduler.py, toolsets.py,
hermes_cli/status.py, agent/prompt_builder.py, gateway/channel_directory.py,
tools/send_message_tool.py).

All routing goes through gateway/platform_registry via register_platform():
- adapter_factory, check_fn, validate_config, is_connected
- env_enablement_fn seeds PlatformConfig.extra from NTFY_* env vars so
  gateway status reflects env-only setups without instantiating httpx
- standalone_sender_fn handles deliver=ntfy cron jobs when cron runs
  out-of-process from the gateway
- allowed_users_env / allow_all_env hook into _is_user_authorized
- cron_deliver_env_var=NTFY_HOME_CHANNEL for cron home routing
- platform_hint surfaces in the system prompt
- pii_safe=True (topic names are the only identifier; no PII to redact)

Tests moved to tests/gateway/test_ntfy_plugin.py using _plugin_adapter_loader
so the module lives under plugin_adapter_ntfy in sys.modules and cannot
collide with sibling plugin-adapter tests on the same xdist worker. The
core-file grep tests (Platform.NTFY in source, hermes-ntfy in toolsets,
etc.) are replaced with plugin-shape tests covering register() metadata,
env_enablement_fn output, and standalone_sender_fn behavior.

68 tests pass under scripts/run_tests.sh.
2026-05-23 16:13:01 -07:00
sprmn24
b10f17bf1e feat(ntfy): add ntfy platform adapter with atomic reconnect, identity fix, and 81 tests 2026-05-23 16:13:01 -07:00
Eugeniusz Gilewski
41d2c758c3 Fix unsafe gateway media path delivery 2026-05-23 01:40:35 -07:00
emozilla
984e6cb5b8 feat(whatsapp): add WhatsApp Business Cloud API adapter
Add an official, production-grade WhatsApp integration via Meta's
Business Cloud API as a complement to the existing Baileys bridge.
No bridge subprocess, no QR codes, no account-ban risk — at the cost
of a Meta Business account and a public HTTPS webhook URL.

Setup is fully wizard-driven: 'hermes whatsapp-cloud' walks through
every credential with paste-time validation (catches the #1 trap of
pasting a phone number into the Phone Number ID field), generates a
verify token, and ends with copy-paste instructions for the
cloudflared / Meta-dashboard / Business Manager pieces that can't be
automated. The wizard also points users at Meta's Business Manager
for setting the bot's display name and profile picture.

Feature set:

- Inbound: text, images (with native-vision routing), voice notes
  (STT), documents (small text inlined, larger cached), reply context.
- Outbound: text with WhatsApp-flavored markdown conversion, images,
  videos, documents, opus voice notes via ffmpeg with MP3 fallback.
- Native interactive buttons for clarify, dangerous-command approval,
  and slash-command confirmation flows — matches the Telegram /
  Discord UX, graceful degrades to plain text.
- Read receipts (blue double-checkmarks) and typing indicator,
  using Meta's combined endpoint so they fire in a single API call.
- Webhook security: X-Hub-Signature-256 HMAC verification (raw body,
  constant-time), wamid deduplication, group-shaped-message refusal
  (groups deferred to v2 — Baileys still covers them).
- Full integration with the gateway's session, cron, display-tier,
  prompt-hint, and auth-allowlist systems. Cloud and Baileys can run
  side-by-side against different phone numbers.

Also wires STT (speech-to-text) through Nous's managed audio gateway
for Nous subscribers — previously the default stt.provider=local
required a separate faster-whisper install. New subscribers now get
voice-note transcription out of the box.

Docs: 418-line user guide at website/docs/user-guide/messaging/
whatsapp-cloud.md, sidebar entry, environment-variables reference,
ADDING_A_PLATFORM.md updated with the optional interactive-UX
contract for future adapter authors.

Tests: 100 dedicated tests for the adapter, 32 for the setup wizard,
20 for the Nous subscription STT wiring, plus regression coverage
across display_config, prompt_builder, and the cron scheduler.

Known limitations (deferred until clear demand signal):
- Group chats — use the Baileys bridge if you need them.
- Message templates for 24-hour-window outside-conversation sends —
  reactive chat is unaffected; cron / delegate_task with gaps > 24h
  will fail with a clear error. The agent's system prompt warns the
  model about this so it knows to mention it when scheduling delayed
  messages.
2026-05-23 01:07:01 -04:00
nekwo
f007ef8ab5 fix(windows): hide cron script subprocess consoles
Apply CREATE_NO_WINDOW flags when the cron scheduler launches job scripts on Windows so gateway-managed no-agent cron jobs do not flash cmd or python console windows every tick.
2026-05-19 11:23:15 -07:00
vanthinh6886
62573f44cf fix: guard yaml.safe_load, flock unlock, TOCTOU races, and atomic writes
1. trajectory_compressor.py: yaml.safe_load() returns None on empty
   files, crashing with TypeError on `if 'tokenizer' in data`. Fix by
   adding `or {}` fallback. (HIGH — blocks startup with empty config)

2. 6 files with fcntl.flock(LOCK_UN) in finally blocks without
   try/except: cron/scheduler.py, hermes_cli/auth.py,
   agent/shell_hooks.py, tools/skill_usage.py,
   tools/environments/file_sync.py, tools/memory_tool.py. If unlock
   raises OSError, fd.close() is skipped and the lock is held forever.
   The msvcrt branches already had try/except; the fcntl branches did
   not. Fix by wrapping in try/except (OSError, IOError): pass.

3. agent/copilot_acp_client.py line 639: TOCTOU race — path.exists()
   followed by path.read_text() with no try/except. If file is deleted
   between the check and the read, FileNotFoundError propagates. Fix
   by using try/except FileNotFoundError.

4. gateway/sticker_cache.py: non-atomic write via Path.write_text()
   can leave truncated JSON on crash, causing JSONDecodeError on next
   load. Fix by writing to tempfile + fsync + os.replace (atomic).
2026-05-19 00:12:41 -07:00
analista
d81b888807 fix(telegram): report cron topic fallback 2026-05-18 22:45:05 -07:00
konsisumer
a4fb0a3ac3 fix(cron): route Telegram cron deliveries to a dedicated topic via TELEGRAM_CRON_THREAD_ID
When Telegram topic mode is enabled, cron messages delivered to the bot's
root DM (TELEGRAM_HOME_CHANNEL without a thread id) land in the system
lobby — replies there are rebuffed with the lobby reminder and
reply_to_message_id is dropped, so users cannot interact with the cron
output (#24409).

Add an optional TELEGRAM_CRON_THREAD_ID env var that overrides
TELEGRAM_HOME_CHANNEL_THREAD_ID for cron deliveries only. Operators can
create a "Cron" forum topic in the DM, point this var at its thread id,
and replies to cron messages will land in that topic's existing session
instead of the lobby. The home-channel thread id (used elsewhere, e.g.
restart notifications) is unchanged, and explicit
deliver="telegram:chat:thread" targets continue to win over the env var.

Per the reporter's clarification on 2026-05-13, option (a) (cron-side
route to a dedicated topic + config knob) was chosen.

Fixes #24409
2026-05-18 22:36:11 -07:00
joe102084
6143013f5b fix: handle whitespace-only cron responses 2026-05-18 20:08:11 -07:00
vanthinh6886
2b538c1f4e fix: guard json.loads() against invalid TTS and skill_view responses
Two code paths call json.loads() on output from external tools without
catching JSONDecodeError. If the tool returns a non-JSON string (error
message, empty string, or None), the entire call path crashes.

1. gateway/run.py — text_to_speech_tool() result in voice reply path.
   A TTS failure that returns an error string instead of JSON crashes
   the voice reply handler, killing the message response entirely.

2. cron/scheduler.py — skill_view() result when loading skills for
   cron jobs. A corrupted or missing skill file that returns an error
   string instead of JSON crashes the cron tick, preventing all jobs
   from executing that cycle.

Both fixes catch (json.JSONDecodeError, TypeError), log a warning,
and gracefully skip the failed operation instead of crashing.
2026-05-18 20:03:44 -07:00
alt-glitch
ef5fe8dfaf fix(cron): gracefully degrade when runtime profile is deleted
Instead of raising FileNotFoundError (which silently bricks the job),
log a warning and fall back to the scheduler default home. Validates
at create/update time still catches typos. Idea from PR #19958.
2026-05-18 17:39:50 +00:00
alt-glitch
1d74d7f73a fix(cron): use delta-based env restore instead of clear+update
Avoids a brief window where other threads see an empty os.environ
during profile job teardown. Idea from PR #19958.
2026-05-18 17:39:50 +00:00
Gianfranco Piana
9c48d47aaf fix(cron): isolate profile job env 2026-05-18 17:39:50 +00:00
Gianfranco Piana
544406ef23 fix: avoid process-wide cron profile home mutation 2026-05-18 17:39:50 +00:00
Gianfranco Piana
bb9ecb2178 feat: add cron job profile support 2026-05-18 17:39:50 +00:00
pr7426
7a7e78a360 fix(cron): prevent parallel job result loss on exception
Replace generator-based result collection with explicit per-future
handling. Each future is now processed independently with a 600s timeout.

Before: _results.extend(f.result() for f in _futures)
- One exception stops the generator, remaining results are lost
- No timeout: one hung job blocks the entire tick

After: as_completed() + per-future try/except
- Each future handled independently
- 600s timeout prevents indefinite blocking
- Failed futures are logged and counted as failures
2026-05-16 23:05:27 -07:00