fix(kanban): respawn guard defers blocker_auth instead of auto-blocking (#28683)

Follow-up to #28455. The respawn guard's blocker_auth rule (last error matched a quota/auth/429 pattern) was auto-blocking the task on first occurrence. That's too aggressive: transient rate limits typically clear in seconds to minutes, but the auto-block puts the task in 'blocked' status which requires manual unblock. Now treats blocker_auth the same as recent_success and active_pr: defer the spawn this tick, leave the task in 'ready', let the next tick try again. If the auth error genuinely persists, the existing consecutive_failures counter trips the auto-block circuit breaker after failure_limit failures via the normal path — so a persistent 401/403/quota-exhausted still ends up blocked, just not on first hit. Also documents the respawn_guarded event in kanban.md's events table with the three guard reasons. Updated test_dispatch_respawn_guard_auto_blocks_auth_error → renamed to test_dispatch_respawn_guard_defers_auth_error_without_auto_block; asserts task stays in 'ready' and the guard reason is recorded.
2026-06-08 08:11:38 +00:00 · 2026-05-19 03:27:45 -07:00 · 2026-05-19 03:27:45 -07:00 · 7bcdced6c1
commit 7bcdced6c1
parent b10b783208
3 changed files with 53 additions and 27 deletions
--- a/website/docs/user-guide/features/kanban.md
+++ b/website/docs/user-guide/features/kanban.md
@ -834,6 +834,7 @@ Every transition appends a row to `task_events`. Each row carries an optional `r
 | `crashed` | `{pid, claimer}` | Worker PID no longer alive but TTL hadn't expired yet. |
 | `timed_out` | `{pid, elapsed_seconds, limit_seconds, sigkill}` | `max_runtime_seconds` exceeded; dispatcher SIGTERM'd (then SIGKILL'd after 5 s grace) and re-queued. |
 | `stale` | `{elapsed_seconds, last_heartbeat_at, heartbeat_age_seconds, timeout_seconds, pid, terminated}` | Task ran longer than `kanban.dispatch_stale_timeout_seconds` (default 4 h) AND no `kanban_heartbeat` arrived in the last hour. Dispatcher SIGTERM'd the host-local worker (if any), reset the task to `ready` for re-dispatch. Does NOT tick the failure counter (stale is dispatcher-side absence detection, not a worker fault). Workers running long operations should call `kanban_heartbeat` at least once an hour to avoid this. |
+| `respawn_guarded` | `{reason}` | Dispatcher refused to re-spawn this ready task this tick. Reasons: `blocker_auth` (last failure was a quota/auth/429 error — wait for the rate window to reset), `recent_success` (a completed run happened in the last hour — wait for review before re-running), `active_pr` (a GitHub PR URL appears in a recent comment — a prior worker already opened a PR). The task stays in `ready`; the next tick gets another chance to spawn. If the underlying condition persists, the normal `consecutive_failures` circuit breaker will auto-block via `gave_up` after `failure_limit` failures. |
 | `spawn_failed` | `{error, failures}` | One spawn attempt failed (missing PATH, workspace unmountable, …). Counter increments; task returns to `ready` for retry. |
 | `protocol_violation` | `{pid, claimer, exit_code}` | Worker exited successfully while the task was still `running`, usually because it answered without calling `kanban_complete` or `kanban_block`. The dispatcher also emits `gave_up` and auto-blocks immediately instead of retrying. |
 | `gave_up` | `{failures, effective_limit, limit_source, error}` | Circuit breaker fired after N consecutive non-successful attempts. Task auto-blocks with the last error. The effective limit resolves as task `max_retries`, then dispatcher `failure_limit` / `kanban.failure_limit`, then the built-in default. |