fix(kanban): respawn guard defers blocker_auth instead of auto-blocking (#28683)

Follow-up to #28455. The respawn guard's blocker_auth rule (last error
matched a quota/auth/429 pattern) was auto-blocking the task on first
occurrence. That's too aggressive: transient rate limits typically
clear in seconds to minutes, but the auto-block puts the task in
'blocked' status which requires manual unblock.

Now treats blocker_auth the same as recent_success and active_pr:
defer the spawn this tick, leave the task in 'ready', let the next
tick try again. If the auth error genuinely persists, the existing
consecutive_failures counter trips the auto-block circuit breaker
after failure_limit failures via the normal path — so a persistent
401/403/quota-exhausted still ends up blocked, just not on first hit.

Also documents the respawn_guarded event in kanban.md's events table
with the three guard reasons.

Updated test_dispatch_respawn_guard_auto_blocks_auth_error → renamed
to test_dispatch_respawn_guard_defers_auth_error_without_auto_block;
asserts task stays in 'ready' and the guard reason is recorded.
This commit is contained in:
Teknium 2026-05-19 03:27:45 -07:00 committed by GitHub
parent b10b783208
commit 7bcdced6c1
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
3 changed files with 53 additions and 27 deletions

View file

@ -834,6 +834,7 @@ Every transition appends a row to `task_events`. Each row carries an optional `r
| `crashed` | `{pid, claimer}` | Worker PID no longer alive but TTL hadn't expired yet. |
| `timed_out` | `{pid, elapsed_seconds, limit_seconds, sigkill}` | `max_runtime_seconds` exceeded; dispatcher SIGTERM'd (then SIGKILL'd after 5 s grace) and re-queued. |
| `stale` | `{elapsed_seconds, last_heartbeat_at, heartbeat_age_seconds, timeout_seconds, pid, terminated}` | Task ran longer than `kanban.dispatch_stale_timeout_seconds` (default 4 h) AND no `kanban_heartbeat` arrived in the last hour. Dispatcher SIGTERM'd the host-local worker (if any), reset the task to `ready` for re-dispatch. Does NOT tick the failure counter (stale is dispatcher-side absence detection, not a worker fault). Workers running long operations should call `kanban_heartbeat` at least once an hour to avoid this. |
| `respawn_guarded` | `{reason}` | Dispatcher refused to re-spawn this ready task this tick. Reasons: `blocker_auth` (last failure was a quota/auth/429 error — wait for the rate window to reset), `recent_success` (a completed run happened in the last hour — wait for review before re-running), `active_pr` (a GitHub PR URL appears in a recent comment — a prior worker already opened a PR). The task stays in `ready`; the next tick gets another chance to spawn. If the underlying condition persists, the normal `consecutive_failures` circuit breaker will auto-block via `gave_up` after `failure_limit` failures. |
| `spawn_failed` | `{error, failures}` | One spawn attempt failed (missing PATH, workspace unmountable, …). Counter increments; task returns to `ready` for retry. |
| `protocol_violation` | `{pid, claimer, exit_code}` | Worker exited successfully while the task was still `running`, usually because it answered without calling `kanban_complete` or `kanban_block`. The dispatcher also emits `gave_up` and auto-blocks immediately instead of retrying. |
| `gave_up` | `{failures, effective_limit, limit_source, error}` | Circuit breaker fired after N consecutive non-successful attempts. Task auto-blocks with the last error. The effective limit resolves as task `max_retries`, then dispatcher `failure_limit` / `kanban.failure_limit`, then the built-in default. |