docs(kanban): worker lane contract page + review-required convention

Closes the architectural-pin part of #19931. Most of what that issue asked for is already implemented (logs under kanban root, env-pinned workspace, dispatcher routing of unknown assignees, lifecycle ownership, structured handoff conventions). What was missing: 1. A written contract integrators can point at when adding a new worker lane shape, and 2. The "code-changing workers should not auto-promote success to done" convention. This commit ships both as docs+convention layered on existing primitives. No kernel changes — the kanban_complete / kanban_block / kanban_comment surfaces already support the review-required pattern; we just hadn't written it down or made it visible to workers. Changes: - `agent/prompt_builder.py::KANBAN_GUIDANCE`: append the review-required exception to step 5 of the lifecycle. Workers get the cue auto-injected into their system prompt — drop structured metadata into a kanban_comment first, then end with kanban_block(reason="review-required: <summary>") instead of kanban_complete when the work needs review. Total prompt size went from ~3000 to ~3275 chars; well under the 4096 budget enforced by test_kanban_guidance_size. - `skills/devops/kanban-worker/SKILL.md`: add a worked example to the existing "Good summary + metadata shapes" section between the Coding-task and Research-task examples. Same shape as the others (kanban_comment with structured handoff JSON, then kanban_block with the human-readable reason). Plus a one-line guide on when to use kanban_complete vs the review-required pattern. - `website/docs/user-guide/features/kanban-worker-lanes.md` (new): the integrator-facing contract. Covers the hierarchy, the three things every lane must provide (assignee, spawn mechanism, lifecycle terminator), the env vars the dispatcher injects, the review-required convention, the failure modes the kernel handles for free, and an explicit "external CLI worker lane" deferred- pending-concrete-asker section that links to #19931 and #19924. - `website/sidebars.ts`: link the new page under user-guide/features. The "specialist worker lanes for external CLI tools (Codex / Claude Code / OpenCode)" runner is NOT shipped here. The dispatcher's spawn_fn parameter already supports plugin-shaped extension; the per-CLI integration work (auth, sandbox policy, exit-code mapping) needs a concrete asker. The new docs page tells would-be integrators the contract any such lane must satisfy. Refs #19931
2026-05-18 04:41:56 +00:00 · 2026-05-10 18:15:08 -07:00 · 2026-05-10 18:15:08 -07:00 · ae83a54be4
commit ae83a54be4
parent 666b751536
4 changed files with 146 additions and 1 deletions
--- a/website/docs/user-guide/features/kanban-worker-lanes.md
+++ b/website/docs/user-guide/features/kanban-worker-lanes.md
@ -0,0 +1,113 @@
+# Kanban worker lanes
+
+A **worker lane** is a class of process that the kanban dispatcher can route tasks to. Each lane has an identity (the assignee string), a spawn mechanism, and a contract for what it must do with the task once spawned.
+
+This page is the contract. It exists for two audiences:
+
+- **Operators** picking which lanes to wire into a board (which profiles to create, which assignees to use).
+- **Plugin / integration authors** wanting to add a new lane shape (a CLI worker that wraps Codex / Claude Code / OpenCode, a containerised review worker, a non-Hermes service that pulls tasks via the API).
+
+If you're writing the worker code itself — the agent that runs *inside* a lane — the [`kanban-worker`](https://github.com/NousResearch/hermes-agent/blob/main/skills/devops/kanban-worker/SKILL.md) skill is the deeper procedural detail.
+
+## The hierarchy
+
+```text
+Hermes Kanban  =  canonical task lifecycle + audit trail
+Worker lane    =  implementation executor for one assigned card
+Reviewer       =  human or human-proxy that gates "done"
+GitHub PR      =  upstreamable artifact (optional, for code lanes)
+```
+
+Hermes Kanban owns lifecycle truth — `ready` → `running` → `blocked` / `done` / `archived`. Worker lanes execute work but never own that truth; everything they do flows back through the kanban kernel via the `kanban_*` tools (or, for non-Hermes external workers, via the API). Reviewers gate the transition from "code change written" to "task done."
+
+## What a lane provides
+
+To be a kanban worker lane, an integration must provide three things:
+
+### 1. An assignee string
+
+The dispatcher matches `task.assignee` against either a Hermes profile name (the default lane shape) or a registered non-spawnable identifier (the plugin lane shape — see [Adding an external CLI worker lane](#adding-an-external-cli-worker-lane) below). Tasks whose assignee doesn't resolve are left on `ready` with a `skipped_nonspawnable` event so a board operator can fix them; they are not silently dropped or executed by an arbitrary fallback.
+
+### 2. A spawn mechanism
+
+For Hermes profile lanes, the dispatcher's `_default_spawn` runs `hermes -p <assignee> chat -q <prompt>` (or the equivalent module form when the `hermes` shim isn't on `$PATH`) inside the task's pinned workspace, with these env vars set:
+
+| Variable | Carries |
+|---|---|
+| `HERMES_KANBAN_TASK` | the task id the worker is operating on |
+| `HERMES_KANBAN_DB` | absolute path to the per-board SQLite file |
+| `HERMES_KANBAN_BOARD` | board slug |
+| `HERMES_KANBAN_WORKSPACES_ROOT` | root of the board's workspace tree |
+| `HERMES_KANBAN_WORKSPACE` | absolute path to *this* task's workspace |
+| `HERMES_KANBAN_RUN_ID` | the current run's id (for the lifecycle gate) |
+| `HERMES_KANBAN_CLAIM_LOCK` | the claim lock string (`<host>:<pid>:<uuid>`) |
+| `HERMES_PROFILE` | the worker's own profile name (for `kanban_comment` author attribution) |
+| `HERMES_TENANT` | tenant namespace, if the task has one |
+
+For non-Hermes lanes (registered via a plugin), the plugin supplies its own `spawn_fn` callable that gets `task`, `workspace`, and `board` and returns an optional pid for crash detection.
+
+### 3. A lifecycle terminator
+
+Every claim must end in exactly one of:
+
+- `kanban_complete(summary=..., metadata=...)` — task succeeds, status flips to `done`.
+- `kanban_block(reason=...)` — task waits for human input, status flips to `blocked`. The dispatcher respawns when `kanban_unblock` runs.
+- The worker process exits without a tool call. The kernel reaps it and emits `crashed` (PID died) or `gave_up` (consecutive-failure breaker tripped) or `timed_out` (max_runtime exceeded). This is the failure path; healthy workers don't end here.
+
+The kanban kernel enforces that exactly one of these terminates each run. A worker that calls neither and exits normally is treated as crashed.
+
+## Outputs and the review-required convention
+
+For most code-changing tasks, the work isn't truly *done* the moment the worker finishes — it needs a human reviewer. The kanban kernel doesn't enforce this distinction (a "code-changing task" is fuzzy and forcing block-instead-of-complete on every code worker would break flows where no review is wanted). It's a convention layered on top:
+
+- **Block instead of complete**, with `reason` prefixed `review-required: ` so the dashboard / `hermes kanban show` surfaces the row as awaiting review.
+- **Drop structured metadata into a `kanban_comment` first** since `kanban_block` only carries the human-readable `reason`. Comments are the durable annotation channel — every audit-relevant field (changed_files, tests_run, diff_path or PR url, decisions) belongs there.
+- **Reviewer either approves and unblocks**, which respawns the worker with the comment thread for follow-ups; or asks for changes via another comment, which the next worker run sees as part of `kanban_show`'s context.
+
+The [`kanban-worker`](https://github.com/NousResearch/hermes-agent/blob/main/skills/devops/kanban-worker/SKILL.md) skill has worked examples for both `kanban_complete` (truly terminal tasks — typo fixes, docs changes, research writeups) and the `review-required` block pattern.
+
+## Logs and audit trail
+
+The dispatcher writes per-task worker stdout/stderr to `<board-root>/logs/<task_id>.log`. Logs are auditable from kanban metadata:
+
+- `task_runs` rows carry the `log_path`, exit code (where available), summary, and metadata.
+- `task_events` rows carry every state transition (`promoted`, `claimed`, `heartbeat`, `completed`, `blocked`, `gave_up`, `crashed`, `timed_out`, `reclaimed`, `claim_extended`).
+- `kanban_show` returns both, so a reviewer (or a follow-up worker) reading the task gets the full history without needing dashboard access.
+
+The dashboard renders run history with summaries, metadata blocks, and exit-status badges. CLI users can run `hermes kanban tail <task_id>` to follow live, or `hermes kanban runs <task_id>` for the historical attempt list.
+
+## Existing lane shapes
+
+### Hermes profile lane (default)
+
+The shape every kanban worker takes today: the assignee is a profile name, the dispatcher spawns `hermes -p <profile>`, the worker auto-loads the [`kanban-worker`](https://github.com/NousResearch/hermes-agent/blob/main/skills/devops/kanban-worker/SKILL.md) skill plus the `KANBAN_GUIDANCE` system-prompt block, and uses the `kanban_*` tools to terminate the run. No setup beyond defining the profile.
+
+When you create profiles for your fleet, choose names that match the *role* you want the orchestrator to route to. The orchestrator (when there is one) discovers your profile names via `hermes profile list` — there's no fixed roster the system assumes (see the [`kanban-orchestrator`](https://github.com/NousResearch/hermes-agent/blob/main/skills/devops/kanban-orchestrator/SKILL.md) skill for the orchestrator side of the contract).
+
+### Orchestrator profile lane
+
+A specialisation of the profile lane: an orchestrator is a Hermes profile whose toolset includes `kanban` but excludes `terminal` / `file` / `code` / `web` for implementation. Its job is decomposing a high-level goal into child tasks via `kanban_create` + `kanban_link` and stepping back. The orchestrator skill encodes the anti-temptation rules.
+
+## Adding an external CLI worker lane
+
+Wiring a non-Hermes CLI tool (Codex CLI, Claude Code CLI, OpenCode CLI, a local coding-model runner, etc.) as a kanban worker lane is *not yet a paved path*. The dispatcher's spawn function is pluggable (`spawn_fn` is a parameter on `dispatch_once`), and a plugin could register its own `spawn_fn` for a non-Hermes assignee, but the surrounding integration work — wrapping the CLI's exit code into `kanban_complete` / `kanban_block` calls, mapping the CLI's workspace/sandbox conventions onto the dispatcher's `HERMES_KANBAN_WORKSPACE` env, handling auth and per-CLI policy — is still per-integration design work.
+
+If you're considering adding a CLI lane, open an issue describing the specific CLI and the workflow you're trying to enable. The contract above is the constraints any such lane must satisfy; the implementation shape (one plugin per CLI vs a generic CLI-runner plugin parameterised by config) is open.
+
+The historical issue for this is [#19931](https://github.com/NousResearch/hermes-agent/issues/19931) and the closed-not-merged Codex-specific PR [#19924](https://github.com/NousResearch/hermes-agent/pull/19924) — those describe the original architecture proposal but didn't land a runner.
+
+## Failure modes the dispatcher handles
+
+So lane authors don't have to reimplement these:
+
+- **Stale claim TTL** — a worker that claims and then never heartbeats / completes / blocks gets reclaimed after `DEFAULT_CLAIM_TTL_SECONDS` (15 min default) — but only if the worker process has actually died. A live worker (slow model spending 20+ min in one tool-free LLM call) gets the claim *extended* instead of killed; only a dead PID is reclaimed.
+- **Crashed worker** — a worker whose host-local PID has vanished is detected by `detect_crashed_workers` and reaped; the task increments `consecutive_failures` and may auto-block when the breaker trips.
+- **Run-level retry** — when a task is retried (post-block, post-crash, post-reclaim), the worker can use the `expected_run_id` parameter on terminating tools to fail fast if its own run was already superseded.
+- **Per-task max runtime** — `task.max_runtime_seconds` hard-caps wall-clock time per run, regardless of PID liveness. Catches genuinely-deadlocked workers that the live-PID extension would otherwise keep running.
+
+## Related
+
+- [Kanban overview](./kanban) — the user-facing intro.
+- [Kanban tutorial](./kanban-tutorial) — walkthrough with the dashboard open.
+- [`kanban-worker`](https://github.com/NousResearch/hermes-agent/blob/main/skills/devops/kanban-worker/SKILL.md) — the skill the worker process loads.
+- [`kanban-orchestrator`](https://github.com/NousResearch/hermes-agent/blob/main/skills/devops/kanban-orchestrator/SKILL.md) — the orchestrator side.
--- a/website/sidebars.ts
+++ b/website/sidebars.ts
@ -68,6 +68,7 @@ const sidebars: SidebarsConfig = {
            'user-guide/features/delegation',
            'user-guide/features/kanban',
            'user-guide/features/kanban-tutorial',
+            'user-guide/features/kanban-worker-lanes',
            'user-guide/features/goals',
            'user-guide/features/code-execution',
            'user-guide/features/hooks',