feat(cron): wire on_jobs_changed, cron.chronos config, docs + agent↔NAS contract

Phase 4F (F.1 + F.2 + F.3, agent side). F.4 is the operator-run live smoke (needs a NAS deployment); recorded in the PR, not code. F.1 — on_jobs_changed wiring: - cron/scheduler.py: _notify_provider_jobs_changed() — resolve the active provider, call on_jobs_changed(), swallow errors. Lives in scheduler.py (not jobs.py) so the store stays free of provider imports (no import cycle). - Wired at the consumer surfaces AFTER a successful mutation: the cronjob model tool (tools/cronjob_tools.py, create/update/remove/pause/resume) — which the `hermes cron` CLI also routes through — and the REST handlers (gateway/platforms/api_server.py, same five). Built-in's no-op default = zero behavior change on the default path. Sleeping-agent direct jobs.json writes (no tool/CLI/REST) are covered by reconcile-on-wake in start(). F.2 — config: cron.chronos.{portal_url,callback_url,expected_audience, nas_jwks_url}. All non-secret; the agent holds no scheduler creds and the outbound provision call reuses the existing Nous token (no token key). Additive deep-merge key, no version literal. F.3 — docs: - docs/chronos-managed-cron-contract.md: authoritative agent↔NAS wire contract (the three agent-cron endpoints + inbound /api/cron/fire + the 3-hop trust model + at-most-once/re-arm semantics). This is what the NAS-side agent builds against. - cron-internals.md: "Managed cron (Chronos) for scale-to-zero" section. - cli-commands.md: cron.provider accepts chronos + the cron.chronos.* keys. - User docs name no scheduler vendor (QStash is a NAS-internal detail). INVARIANT re-verified: zero qstash/upstash hits across plugins/cron, gateway, hermes_cli, tools, website/docs (the one remaining repo hit is an unrelated Context7 MCP comment in tools/mcp_tool.py). Tests: test_jobs_changed_notify (5) — notify calls provider hook, swallows errors, built-in harmless, tool create/remove notify. Full cron + chronos + webhook + config + api_server_jobs suites green (504 in the cron+chronos+webhook run).
2026-06-21 10:22:18 +00:00 · 2026-06-18 15:11:32 +10:00 · 2026-06-18 15:11:32 +10:00 · b75757d4aa
commit b75757d4aa
parent 3fc7b624d8
8 changed files with 409 additions and 5 deletions
--- a/docs/chronos-managed-cron-contract.md
+++ b/docs/chronos-managed-cron-contract.md
@ -0,0 +1,192 @@
+# Chronos managed-cron — agent ↔ NAS wire contract
+
+**Status:** authoritative wire spec for the Chronos cron provider.
+**Audience:** the NAS-side implementer of the `agent-cron` endpoints
+(`nous-account-service`) and anyone debugging the managed-cron path.
+
+Chronos lets a hosted Hermes gateway **scale to zero** while idle and still
+fire cron jobs. Instead of an in-process 60-second ticker, the agent asks NAS
+to arm exactly **one external one-shot per job at that job's real next-fire
+time**. NAS calls the agent back at fire time over an authenticated webhook;
+the agent runs the job and re-arms the next one-shot. Between fires the agent
+process can be fully stopped — it wakes only on a genuine fire.
+
+The external scheduler NAS uses to implement the one-shots is an **internal NAS
+implementation detail**. The agent never talks to it, never holds its
+credentials, and never names it. The agent only knows the three NAS endpoints
+below.
+
+```
+create/update/pause/resume/remove a cron job (agent side)
+  │
+  ▼
+ChronosCronScheduler.reconcile()        ── agent computes next_run_at
+  │  POST {portal}/api/agent-cron/provision   (auth: agent's Nous access token)
+  ▼
+NAS arms a one-shot for fire_at         ── NAS owns the scheduler + its creds
+  │
+  ⏰ at fire_at
+  ▼
+scheduler → POST {portal}/api/agent-cron/relay   (auth: scheduler signature, NAS-verified)
+  │
+  ▼
+NAS mints a short-lived agent-audience JWT (purpose=cron_fire)
+  │  POST {agent_callback_url}/api/cron/fire        (auth: that JWT)
+  ▼
+agent verifies the NAS JWT → store CAS claim → run_one_job → re-arm next one-shot
+```
+
+## Trust model (read this first)
+
+| Hop | Who calls whom | Auth mechanism | Verified by |
+|---|---|---|---|
+| 1 | agent → NAS (`provision`/`cancel`/`list`) | the agent's existing **Nous Portal access token** (Bearer) | NAS (its normal agent-token path) |
+| 2 | scheduler → NAS (`relay`) | the scheduler's request **signature** | NAS (the signature path it already has) |
+| 3 | NAS → agent (`/api/cron/fire`) | a **short-lived NAS-minted JWT** (`aud=agent:{instance_id}`, `purpose=cron_fire`) | agent (PyJWT against NAS JWKS) |
+
+Why NAS-mediated rather than scheduler→agent direct: the scheduler signs with
+**NAS's** keys, which the agent does not (and should not) hold. The agent can
+only verify a **NAS-minted** token — a trust path it already has. This keeps
+all scheduler credentials inside NAS. (Full rationale: the plan's DQ-4.)
+
+No new secret is introduced on the agent: hop 1 reuses the token the agent
+already uses for the portal, and hop 3 reuses the NAS-JWT verification the agent
+already performs.
+
+---
+
+## Endpoint 1 — `POST /api/agent-cron/provision`  (agent → NAS)
+
+Arm (or re-arm, idempotently) exactly one one-shot for a job.
+
+- **Auth:** `Authorization: Bearer <agent Nous access token>`. NAS validates via
+  its normal agent-token path and scopes the row to the calling agent/org.
+- **Request body:**
+  ```json
+  {
+    "job_id": "ab12cd34",
+    "fire_at": "2026-06-18T12:34:56+00:00",
+    "agent_callback_url": "https://agent-xyz.fly.dev",
+    "dedup_key": "ab12cd34:2026-06-18T12:34:56+00:00"
+  }
+  ```
+  - `fire_at` — ISO 8601, **agent-computed**. May be sub-minute in the future;
+    NAS must honor second-granularity (the agent owns the time, so there is no
+    1-minute scheduler floor).
+  - `agent_callback_url` — the agent's own publicly-reachable base URL. NAS
+    POSTs `{agent_callback_url}/api/cron/fire` at fire time.
+  - `dedup_key` — `"{job_id}:{fire_at}"`. NAS **upserts by `(agent_id, job_id)`**
+    so re-arming the same fire is idempotent (no duplicate one-shots). A new
+    `fire_at` for the same `job_id` replaces the prior arm.
+- **Action:** arm one one-shot to fire at `fire_at`, destined for the NAS
+  **relay** route (Endpoint 3) — NOT the agent directly, so NAS stays in the
+  loop to mint the agent JWT. Persist `(agent_id, job_id, schedule_id,
+  agent_callback_url)`.
+- **Response:** `200 {"schedule_id": "<opaque>"}`.
+
+## Endpoint 2 — `POST /api/agent-cron/cancel`  (agent → NAS)
+
+- **Auth:** same as Endpoint 1.
+- **Body:** `{"job_id": "ab12cd34"}`.
+- **Action:** cancel the armed one-shot for `(agent_id, job_id)` and delete the
+  row. Idempotent — cancelling an unknown job is a 200 no-op.
+- **Response:** `200 {"ok": true}`.
+
+## Endpoint 3 — `POST /api/agent-cron/relay`  (scheduler → NAS, the fire relay)
+
+- **Auth:** the scheduler's request **signature**, verified by NAS with the
+  signature path it already has. This is the trust boundary for the fire — a
+  forged relay call must be rejected here.
+- **Action:**
+  1. Look up `(agent_id, job_id) → agent_callback_url` from the persisted row.
+  2. Mint a **short-lived** JWT: `aud = "agent:{instance_id}"`,
+     `iss = {portal_url}`, `purpose = "cron_fire"`, small `exp` (≈60–120s),
+     signed with NAS's normal asymmetric signing key (published via JWKS).
+  3. `POST {agent_callback_url}/api/cron/fire` with
+     `Authorization: Bearer <that JWT>` and body `{"job_id": "...", "fire_at": "..."}`.
+  4. Treat a non-2xx agent response as a **retryable** failure (let the
+     scheduler retry the relay). The agent's store CAS de-dupes a double fire,
+     so retries are safe.
+- **Response to the scheduler:** 2xx once the agent POST is accepted (202), so
+  the scheduler does not retry a delivered fire.
+
+---
+
+## Inbound `POST /api/cron/fire`  (NAS → agent) — agent side, already implemented
+
+This is the agent endpoint NAS calls in Endpoint 3 step 3. Implemented on the
+`APIServerAdapter` (`gateway/platforms/api_server.py`); the verifier is
+`plugins/cron/chronos/verify.py`.
+
+- **Auth:** `Authorization: Bearer <NAS-minted JWT>`. The agent verifies:
+  - signature against the NAS JWKS (`cron.chronos.nas_jwks_url`),
+  - `aud` == `cron.chronos.expected_audience` (this agent's
+    `agent:{instance_id}`),
+  - `iss` == `cron.chronos.portal_url`,
+  - `exp` / `nbf` (30s leeway),
+  - `purpose == "cron_fire"` — a general agent JWT (no/other purpose) is
+    rejected so it can't be replayed against this endpoint.
+- **Body:** `{"job_id": "ab12cd34", "fire_at": "..."}` (only `job_id` is used).
+- **Behavior:**
+  - invalid/missing/forged/expired/wrong-aud/wrong-purpose token → **401**, no
+    execution.
+  - missing `job_id` → **400**.
+  - valid → **202 `{"status": "accepted", "job_id": "..."}`** immediately, and
+    the job runs in the background. 202-before-run means a long agent turn never
+    trips the relay's HTTP timeout.
+- **At-most-once:** the agent claims the job with a store-level compare-and-set
+  (`claim_job_for_fire`) before running. A relay/scheduler retry that arrives
+  while the first fire is in flight (or after it completed) loses the claim and
+  does not double-run.
+
+---
+
+## At-most-once & re-arm semantics
+
+- **Recurring (cron/interval):** on fire, the agent advances `next_run_at`
+  (under its store lock) as part of the claim, runs the job, then re-provisions
+  a one-shot for the new `next_run_at`. A duplicate relay for the old `fire_at`
+  finds the claim taken / time advanced and is dropped.
+- **One-shot (`30m`, `+90s`, etc.):** fires once; `mark_job_run` marks it
+  completed. No re-arm.
+- **`repeat.times = N`:** `mark_job_run` deletes the job at the limit, so
+  `get_job` returns `None` after the final fire → the agent does **not** re-arm
+  → the schedule stops cleanly with no orphaned one-shot.
+- **Multi-replica agents:** the store CAS makes the fire at-most-once across N
+  gateway replicas sharing one `HERMES_HOME` — exactly one replica runs each
+  fire.
+
+## Reconcile (self-healing)
+
+The agent reconciles desired (`jobs.json`) vs armed on:
+- `start()` (gateway boot / wake),
+- every successful job mutation (`on_jobs_changed`),
+- piggybacked after each fire (re-arm).
+
+Reconcile arms missing/changed-time jobs and cancels orphans. A missed
+provision (transient NAS error) self-heals on the next reconcile. There is **no
+periodic wake** of a sleeping agent — that would negate scale-to-zero.
+
+## Config (agent side)
+
+All non-secret (`cron.chronos.*` in `config.yaml`); the agent holds no scheduler
+credentials. For hosted agents NAS sets these at provision time:
+
+| key | meaning |
+|---|---|
+| `cron.provider` | `"chronos"` to activate (empty = built-in ticker) |
+| `cron.chronos.portal_url` | NAS base URL (also the expected JWT `iss`) |
+| `cron.chronos.callback_url` | the agent's own public base URL for NAS→agent fires |
+| `cron.chronos.expected_audience` | this agent's JWT `aud` (`agent:{instance_id}`) |
+| `cron.chronos.nas_jwks_url` | NAS JWKS for verifying the fire JWT |
+
+If `callback_url` / `portal_url` is blank or the agent has no Nous login,
+`is_available()` returns False and the resolver falls back to the built-in
+in-process ticker — cron never loses its trigger.
+
+## Escape hatch (not default)
+
+The inbound `/api/cron/fire` verifier is pluggable (`get_fire_verifier()`). If
+relay volume through NAS ever saturates, a direct scheduler→agent mode with a
+per-job NAS-minted cron-key can replace the NAS-JWT verifier with **no change to
+the webhook handler**. NAS-mediated (this contract) is the default.