hermes-agent

mirror of https://github.com/NousResearch/hermes-agent.git synced 2026-06-24 10:52:21 +00:00

Author	SHA1	Message	Date
Teknium	84e1d31e54	refactor(kanban): fold worker/orchestrator skills into injected guidance (#50473 ) The kanban-worker and kanban-orchestrator bundled skills existed only to be force-loaded into dispatcher-spawned workers, gated by environments:[kanban] so they wouldn't leak into normal CLI listings. That gating was fragile (the leak that #50443 patched) and the --skills auto-load was already best-effort — most workers ran without it because the bundled skill isn't present in profile-scoped skills dirs. Remove the skills entirely and promote their load-bearing content (workspace kinds, deliverable artifacts, created-card integrity, profile discovery) into KANBAN_GUIDANCE, which is already injected into every kanban worker's system prompt. Net result: every worker reliably gets the guidance, nothing can leak into a CLI/blank-slate session, and the gating machinery is gone. - agent/prompt_builder.py: promote the 4 load-bearing rules into KANBAN_GUIDANCE - hermes_cli/kanban_db.py: drop --skills kanban-worker auto-injection + _kanban_worker_skill_available probe - hermes_cli/kanban_swarm.py: drop skills=[kanban-orchestrator] on the root card - hermes_cli/kanban.py: drop kanban-init skill seeding; fix help text - delete skills/devops/kanban-{worker,orchestrator} - docs: delete the two skill pages (EN+zh), fix sidebars/catalog/kanban.md/kanban-worker-lanes.md and the video-orchestrator + codex-lane references - tests: update spawn-argv expectations; re-bound the guidance-size guard Supersedes the skill-leak half of #50443 (credit @helix4u for flagging the area).	2026-06-21 17:06:48 -07:00
Teknium	d164ed0326	fix(kanban): make reclaim claim-lock-aware to stop task/run status desync (#50366 ) After a worker crash + reclaim + respawn, the board could show a task in the Ready lane while its task_run was 'running' and the new worker was actively executing (#36910). The dispatcher could then treat live work as available and double-assign. Root cause: the three reclaim paths (detect_crashed_workers, release_stale_claims heartbeat-stale backstop, enforce_max_runtime) each snapshot a task's worker_pid/claim_lock, do liveness work, then reset tasks.status back to 'ready' with only a 'WHERE status=running' guard. If the task was reclaimed AND re-claimed by a NEW worker in between (new run, new claim_lock, live pid), the stale UPDATE clobbered the live task: status flipped to 'ready' while the fresh run stayed 'running'. claim_task is the only writer that sets status='running', so nothing put it back — permanent desync. Fix: gate each reset on the snapshot's claim_lock (and worker_pid where available) so it only fires when the task is still owned by the worker the reclaim was computed for. A stale reclaim now no-ops (rowcount 0) instead of desyncing a re-claimed task. Genuine crashes (lock still matches) reclaim exactly as before. This is the same race class the in-gateway dispatch lock (single-writer ticks) mitigates, closed at the row level so a single dispatcher's fast reclaim->respawn across two ticks is also safe. Closes #36910.	2026-06-21 12:49:07 -07:00
Teknium	84ba83b09a	fix(kanban): bound the cross-process init lock so connect() can't hang forever (#50353 ) connect() wrapped its entire body in an unbounded blocking flock(LOCK_EX) on every call (_cross_process_init_lock). A single process stalled inside the critical section — or a stale lock held by a wedged worker — blocked every other connect(), including the long-lived gateway dispatcher's next-tick connect, forever. No timeout, no traceback, no recovery: the board silently stopped being worked until a manual restart (issue #36644). Two fixes: 1. Fast-path skip: once THIS process has initialized a path, the expensive first-open work (header validation, integrity probe, schema + additive migrations) is already cached in _INITIALIZED_PATHS. The steady-state connect has nothing for the cross-process lock to protect, so it now opens the connection (WAL + pragmas) under only the cheap in-process _INIT_LOCK and never touches the file lock. This removes the lock from the dispatcher's hot path entirely — a stalled external 'hermes kanban list' can no longer block ticks. 2. Bounded acquire: even on first-init, _cross_process_init_lock now retries a non-blocking acquire up to a 10s deadline, then logs a WARNING and proceeds WITHOUT the cross-process lock. Safe because the in-process _INIT_LOCK still serializes same-process threads and the init work is idempotent (CREATE TABLE IF NOT EXISTS + additive migrations) — worst case is redundant work, not corruption. A bounded 'proceed anyway' beats an unbounded hang. Windows path switched LK_LOCK -> LK_NBLCK (non-blocking) to match. Closes #36644.	2026-06-21 12:43:41 -07:00
Teknium	9630ec6c19	fix(kanban): pin worker TERMINAL_CWD to the task workspace (#50348 ) _default_spawn launched the worker subprocess with cwd=workspace and set HERMES_KANBAN_WORKSPACE, but never set TERMINAL_CWD — so the worker inherited the dispatching gateway's TERMINAL_CWD. That value takes precedence over the process cwd in two places: - tools/file_tools.py::_resolve_base_dir — a relative write_file path resolved against the gateway user's home instead of the workspace, so artifacts silently landed outside the workspace (#41312). - agent_init's context-file loader — AGENTS.md was discovered relative to the gateway's cwd, so under multi-profile dispatch a worker loaded whichever gateway won the claim race's AGENTS.md, not the task's (#34619). Both are the same root cause. Pinning TERMINAL_CWD to the workspace (where the task's work actually happens) fixes both. Guarded on an existing absolute dir because file_tools rejects relative/sentinel TERMINAL_CWD values — a non-dir workspace leaves the inherited value rather than writing a meaningless one. Closes #34619, closes #41312.	2026-06-21 12:43:37 -07:00
Teknium	e217fd42e2	feat(kanban): add task lifecycle plugin hooks (claimed/completed/blocked) (#50349 ) Plugins could observe session/tool/approval lifecycle but had no way to observe kanban task transitions. Adds three observer hooks fired by the board's claim/complete/block transitions: - kanban_task_claimed (dispatcher process, before worker spawn) - kanban_task_completed (worker process, carries summary) - kanban_task_blocked (worker process, carries reason) Each fires AFTER the DB write txn commits, so a plugin observes durable state and a slow/hanging callback can never hold the SQLite write lock. All firing is best-effort: a raising hook is logged and swallowed and never breaks a board transition. profile_name is resolved from HERMES_HOME so dispatcher- and worker-side hooks carry the right profile. Requested by @Smithangshu on Discord.	2026-06-21 12:38:14 -07:00
Teknium	e581740aa1	fix(kanban): single-writer dispatch lock to prevent orphan-dispatcher DB corruption (#50331 ) A shell-launched 'hermes gateway run --replace' / 'gateway restart' on a systemd/launchd host can leave an orphan gateway whose kanban dispatcher escapes the service cgroup, survives 'systemctl restart', and becomes a second long-lived writer on the shared kanban.db. Two dispatchers that each believe they own the file both pass SQLite busy_timeout and then race on WAL frames — the documented root cause of multi-writer corruption (issue #35240). The existing _guard_supervised_gateway_conflict startup guard blocks the common way an orphan is born, but does nothing once a second dispatcher already exists. This adds the defense-in-depth: dispatch_once now wraps every tick in a non-blocking, board-scoped flock (_dispatch_tick_lock). A losing dispatcher returns DispatchResult(skipped_locked=True) and does zero DB writes this tick — so two dispatchers can never run a reclaim/spawn/write sequence concurrently regardless of how the second one got there. - Non-blocking (LOCK_NB): never stalls the gateway's async watcher. - Board-scoped: lock file is a .dispatch.lock sibling of each board's kanban.db, so unrelated boards tick in parallel. - POSIX + Windows (fcntl / msvcrt LK_NBLCK), no-op degrade where neither exists — mirrors the existing _cross_process_init_lock pattern. Verified with a real two-process orphan repro: while a separate process holds the lock, dispatch_once skips; after release it runs.	2026-06-21 12:06:24 -07:00
teknium1	15cfc2836f	fix(kanban): anchor no-path worktree tasks on board default_workdir Follow-up to the salvaged worktree-materialization fix. When a worktree task has no explicit workspace_path, resolve the anchor from the board's default_workdir (a git repo) and materialize <repo>/.worktrees/<id> per task, instead of silently rooting under the dispatcher's CWD (whatever directory launched the gateway, e.g. the Hermes checkout). If no default_workdir is configured, raise with a clear message rather than guessing from CWD. Adds AUTHOR_MAP entry for the salvaged commit.	2026-06-20 19:12:23 -07:00
Ahmad Ashfaq	d79f67fda6	fix(kanban): materialize and reuse linked worktrees for worktree tasks The dispatcher treated workspace_kind=worktree as metadata only and never ran 'git worktree add', so every worktree task ran in the main repo checkout instead of an isolated worktree — concurrent tasks silently shared one tree and contaminated each other. This materializes a real linked worktree at <repo>/.worktrees/<task_id> on branch wt/<task_id> when resolve_workspace() handles a worktree task, treats a repo-root workspace_path as shorthand for that location, persists the derived workspace/branch back onto the task row, and — on rerun/redispatch — detects an already-materialized linked worktree (via git-common-dir) and reuses it instead of nesting a second .worktrees/<id> inside it.	2026-06-20 19:12:23 -07:00
Teknium	35e7ca03d5	fix(kanban): treat already-gone worker as terminated, not survived _terminate_reclaimed_worker early-returned on ProcessLookupError with terminated=False. The new reclaim-defer guard reads that as 'worker survived the kill' and defers the reclaim forever, so a stale task whose worker is already dead never lands in result.stale. ProcessLookupError means the process is gone — that IS a successful termination. Split it from the generic OSError branch and set terminated=True.	2026-06-19 07:38:10 -07:00
Sahil Saghir	b9e521da23	fix(kanban): hold reclaim while the worker is still alive release_stale_claims and detect_stale_running call _terminate_reclaimed_worker and then release the task claim unconditionally, even when the termination did not actually kill the worker. _terminate_reclaimed_worker already reports this via its "terminated" flag, but the callers ignore it. When a worker is parked in uninterruptible (D) state — for example throttled by a cgroup memory.high limit — a pending SIGTERM/SIGKILL cannot be delivered until the throttle lifts, so the kill is a no-op. The dispatcher then frees the claim and spawns a fresh worker beside the still-alive one. Repeated every dispatch tick this accumulates duplicate workers without bound, deepening the memory pressure that caused the throttle in the first place — a self-reinforcing runaway. Fix: gate both automatic reclaim paths on _worker_survived_termination(). When we attempted to kill our own host-local worker and it is still alive, defer the reclaim (_defer_reclaim_for_live_worker extends the claim a short grace and emits a reclaim_deferred event) instead of releasing. This guarantees at most one live worker per task and is self-correcting: not spawning a duplicate is what relieves the pressure so the pending signal lands and the worker dies, and the next tick reclaims cleanly. Non-host-local claims and the operator-driven reclaim_task() path keep their existing force-release behaviour. Related: #41448 (concurrent dispatchers amplify this by doubling reclaim frequency); #42858 (kill the worker rather than orphan it on archive). Tests: defer-when-worker-survives, reclaim-when-killed, release-when-not-host-local, and the detect_stale_running path.	2026-06-19 07:38:10 -07:00
Teknium	cb125c2b3f	fix(kanban): pin assigned profile toolsets for workers (#45590 )	2026-06-13 05:50:09 -07:00
teknium1	76f01780f0	fix(kanban): sweep deferred scratch parent on non-scratch child completion + tests Follow-up on the deferred-cleanup salvage (#33774): _cleanup_workspace returned early for a non-scratch ('dir'/'worktree') task and never ran the parent sweep, so a scratch parent waiting on a 'dir' child would leak its deferred workspace forever. Run the parent sweep before the early return. Adds regression tests: deferred-while-child-active, swept-after-last-child, and dir-child-unblocks-scratch-parent.	2026-06-07 09:50:44 -07:00
annguyenNous	9405cd0812	fix: defer scratch workspace cleanup when task has active children (#33774 ) When a Kanban task with workspace_kind=scratch completes, the _cleanup_workspace() function immediately deletes the workspace directory. If the task has children linked via task_links, those children find the workspace deleted when they start. This fix adds two checks: 1. Before deleting, check if any children are still active (todo/ready/running). If so, defer cleanup. 2. After a child completes, check if parent workspace can now be cleaned up (all children terminal). Fixes NousResearch/hermes-agent#33774	2026-06-07 09:50:44 -07:00
worlldz	081694c111	fix(kanban): isolate board override per concurrent call	2026-06-04 07:39:53 -07:00
Teknium	4c544b633d	fix(kanban): don't permanently block tasks that hit a provider rate limit (#38223 ) A kanban worker that exhausted its retries purely on a provider rate limit / quota wall (e.g. opencode-go's 5-hour window) exited with code 1. The dispatcher counted that as a crash, and with DEFAULT_FAILURE_LIMIT=2 two quota-wall hits permanently blocked the card. Fanning out many workers against one shared quota made this routine. Now a rate-limited worker exits with EX_TEMPFAIL (75); the dispatcher classifies that as a 'rate_limited' exit, releases the task back to 'ready' WITHOUT incrementing consecutive_failures (the breaker can't trip on a transient throttle), and the respawn guard defers the next attempt on a cooldown (default 5min, HERMES_KANBAN_RATE_LIMIT_COOLDOWN_SECONDS) until the quota window clears. Genuine crashes still count and trip the breaker as before. The 120s Retry-After cap is unchanged — no worker parks for hours holding a slot. - conversation_loop.py: surface failure_reason in the exhaustion return - cli.py: kanban worker picks exit 75 on rate_limit/billing failure - kanban_db.py: rate_limited exit kind, no-count requeue, cooldown guard	2026-06-03 06:19:32 -07:00
Teknium	72e82f88c0	fix(kanban): decompose children inherit root workspace instead of forcing scratch (#37172 ) decompose_triage_task hardcoded every fan-out child to workspace_kind 'scratch', ignoring the root task's workspace. A code-gen task created with a dir:/worktree: workspace would fan out into throwaway scratch tmp dirs (GC'd on archive), so generated code never landed in the project. Children now inherit the root's workspace_kind + workspace_path. A child dict may still override with its own workspace_kind/workspace_path; the path only carries over when kinds match. Scratch roots are unchanged.	2026-06-01 20:26:57 -07:00
Teknium	0cd7d54b00	feat(kanban): goal_mode cards run workers in a /goal loop (#35710 ) * feat(kanban): goal_mode cards run workers in a /goal loop A goal_mode card wraps its dispatched worker in the Ralph-style goal loop behind /goal: after each turn an auxiliary judge checks the worker's response against the card title+body, and if not done the worker keeps going in the SAME session until the judge agrees, the worker terminates the task itself, or the turn budget runs out (which blocks the card for human review — never a silent exit). - kanban_db: goal_mode + goal_max_turns columns (additive migration), Task fields, create_task params, INSERT wiring, created-event payload. - kanban_tools: goal_mode/goal_max_turns on the kanban_create tool so orchestrators can opt cards in when fanning out. - kanban CLI: --goal / --goal-max-turns on 'kanban create'. - dashboard API: goal_mode/goal_max_turns on the create endpoint (auto-surfaced back via asdict). - _default_spawn: sets HERMES_KANBAN_GOAL_MODE / _GOAL_MAX_TURNS only when the card opts in. - goals.run_kanban_goal_loop: standalone, callback-injected loop engine (no SessionDB persistence; ephemeral worker). cli.py quiet path calls it after the worker's first turn when the env vars are set. - Docs: orchestrator skill + kanban feature page. Tests: DB roundtrip + legacy migration, spawn env gating, and the loop's continuation/completion/budget-block/finalize-nudge branches. E2E run against a real kanban DB confirms a budget-exhausted goal worker lands in a sticky blocked state. * feat(kanban/dashboard): goal-mode toggle in the create form Wires the goal_mode card setting into the dashboard UI (the plugin's hand-written IIFE bundle, no build step): - InlineCreate: 'goal mode' checkbox after the skills field; checking it reveals an optional 'max turns' number input. Both reset on submit and only post goal_mode/goal_max_turns when enabled. - TaskDrawer: a 'Goal mode: on (max N turns)' MetaRow so a card's goal-mode setting is visible after creation (auto-fed by asdict via the existing _task_dict). Live-tested through the running dashboard with a browser: created a goal-mode card with max-turns=8, confirmed it persisted to the kanban DB (goal_mode=1, goal_max_turns=8) and rendered back in the drawer as 'on (max 8 turns)'. No JS console errors.	2026-05-31 01:16:33 -07:00
Teknium	b47cb1bbf2	feat(kanban): file attachments on tasks (#35395 ) Tasks can now carry file attachments (PDFs, images, source docs) that workers read directly — closes the gap where source material had to be pasted as a path into the task body. - kanban_db: task_attachments table (additive), Attachment dataclass, add/list/get/delete accessors, attachments_root/task_attachments_dir path helpers (per-board, HERMES_KANBAN_ATTACHMENTS_ROOT override) - build_worker_context: surfaces each attachment's absolute path so the worker (full file/terminal tool access) reads it via read_file/pdftotext - dashboard API: POST/GET/DELETE attachment routes (multipart upload, 25MB cap, traversal-safe filenames, root-containment check on download) - dashboard UI: Attachments section in the task drawer — upload button, list with download, per-row remove - docs + tests (13 cases: DB accessors, REST round-trip, traversal rejection, collision suffixing, worker-context surfacing) Closes #35338	2026-05-30 07:41:04 -07:00
teknium1	8e5a6854c3	fix(kanban): align recompute_ready guard with breaker's configured failure_limit Follow-up to the budget-exhaustion recovery fix. recompute_ready's new circuit-breaker guard resolved its effective limit from per-task max_retries -> DEFAULT_FAILURE_LIMIT, skipping the dispatcher's configured kanban.failure_limit. _record_task_failure resolves max_retries -> failure_limit(config) -> DEFAULT, so the two disagreed whenever an operator set kanban.failure_limit != 2: - config > 2: a task could get stuck at DEFAULT(2) before reaching its allowed retry count. - config < 2: a task the breaker already blocked could be auto-recovered back to ready, defeating the stricter limit. Thread the dispatcher's failure_limit through dispatch_once into recompute_ready so the guard and the breaker share one resolution order. Updated test_circuit_breaker_block_still_auto_promotes (it asserted a failures=5 block auto-recovers and resets the counter — that's the pre-#35072 behavior the loop fix removes); it now exercises a below-limit transient block, with the at-limit case covered in test_kanban_db.py. Added two tests for the config-tier and per-task override resolution.	2026-05-30 01:40:57 -07:00
liuhao1024	6ab71d3bb4	fix(kanban): prevent infinite retry loop when worker exhausts iteration budget recompute_ready() previously reset consecutive_failures to 0 when auto-recovering a blocked task. This defeated the circuit-breaker: a task that repeatedly exhausted its iteration budget would cycle forever (block → auto-recover with counter=0 → respawn → budget exhausted → block → …) with no signal to the operator. Fix: don't auto-recover tasks whose consecutive_failures has reached the effective failure limit (per-task max_retries or DEFAULT_FAILURE_LIMIT). The counter is also preserved across recovery so the breaker can accumulate across cycles. Fixes #35072	2026-05-30 01:40:57 -07:00
teknium1	c70dca3a88	fix(kanban): rebuild legacy TEXT-PK tables to INTEGER AUTOINCREMENT on open Legacy kanban boards (pre-AUTOINCREMENT schema) crashed the gateway notifier on every tick — int(None) on a NULL id in unseen_events_for_sub — silently losing all kanban notifications. CREATE TABLE IF NOT EXISTS skips existing tables regardless of schema and _add_column_if_missing only adds columns, so neither could fix a drifted primary-key type. _rebuild_drifted_tables() detects the legacy shape via PRAGMA table_info and rebuilds task_events/task_comments/task_runs (TEXT PK -> INTEGER AUTOINCREMENT) and kanban_notify_subs.last_event_id (TEXT/NULL -> INTEGER NOT NULL DEFAULT 0), preserving data. The whole pass is one transaction so an interruption can't leave a table half-renamed, and recreates every index DROP TABLE would otherwise take down (including idx_events_run). Co-authored-by: liuhao1024 <liuhao1024@users.noreply.github.com>	2026-05-30 01:40:49 -07:00
teknium1	ddaf2f6712	style: restore PEP8 blank-line separation after dead-code removal The deletions in the salvaged commit left some top-level defs/classes separated by a single blank line. Restore the 2-blank-line separation.	2026-05-29 04:22:27 -07:00
kshitijk4poor	dc235e93cb	chore: remove dead code — 28 unused functions/classes across 16 files Vulture + per-symbol verification (whole-repo grep incl. tests, string literals, getattr, decorator/registry/argparse dispatch) confirmed each of these has zero callers anywhere — not reachable via any dynamic-dispatch path, not referenced by tests, not re-exported. Removed: - acp_adapter/tools.py: _build_patch_mode_content - agent/anthropic_adapter.py: read_claude_managed_key (diagnostics-only, never called) - agent/bedrock_adapter.py: get_bedrock_model_ids - agent/browser_registry.py: get_active_browser_provider - agent/chat_completion_helpers.py: _take_request_client (x2 nested closures, never invoked) - gateway/platforms/weixin.py: _rewrite_headers_for_weixin, _rewrite_table_block_for_weixin - hermes_cli/banner.py: _skin_branding - hermes_cli/debug.py: _delete_hint - hermes_cli/gateway.py: _setup_email, _setup_sms, _setup_yuanbao (platform keys absent from the _builtin_setup_fn dispatch dict; handled by the _setup_standard_platform fallback) - hermes_cli/kanban_db.py: set_max_runtime, active_run - hermes_cli/kanban_diagnostics.py: severity_of_highest, _latest_clean_event_ts - hermes_cli/main.py: _build_provider_choices, cmd_portal (portal subcommand is wired via portal_cli.add_parser, not this wrapper) - hermes_cli/model_switch.py: CustomAutoResult (orphaned by the switch_model() extraction) - hermes_cli/models.py: format_model_pricing_table, fetch_nous_account_tier - hermes_cli/portal_cli.py: _nous_portal_base_url - hermes_cli/proxy/server.py: handle_models_fallback (defined but never registered on the router) - tools/computer_use/cua_backend.py: _parse_element, _is_arm_mac - tools/file_operations.py: _get_safe_write_root (prod uses the imported agent.file_safety.get_safe_write_root directly) - tools/skills_tool.py: _load_category_description Also dropped two imports left unused by the removals: - tools/file_operations.py: get_safe_write_root alias - tools/computer_use/cua_backend.py: import platform Pure deletion: -551 LOC. No behavior change. Test files covering the edited modules pass (640/640); the broader suite's pre-existing/env-dependent failures reproduce unchanged on origin/main.	2026-05-29 04:22:27 -07:00
teknium1	592a4ffb6b	fix(kanban): close three blocked/iteration-exhausted handling gaps (#29747 ) Reporter diagnosed three independent gaps that together allowed infinite 'unblock → re-stuck' loops with no surfacing or escalation: GAP 1: `_rule_stuck_in_blocked` resets timer on any `commented`/`unblocked` event, so a task that cycles every few minutes is invisible to it regardless of how many times it cycles. Fix: new `_rule_block_unblock_cycling` rule (`hermes_cli/kanban_diagnostics.py`) that counts block→unblock cycles in a sliding window. Default threshold 3 cycles within 24h, configurable via `block_cycle_threshold` / `block_cycle_window_seconds`. Walks events in arrival order (event id) since multiple events can share the same `created_at` second. Fires as a warning with a CLI hint to inspect the block reasons. GAP 2: Iteration-budget-exhausted runs in kanban workers map to `kanban_block` (status=blocked, but a clean exit from the kernel's perspective). `_rule_repeated_failures` reads `consecutive_failures`, which `_record_task_failure` increments only for crashed/timed_out/ spawn_failed — `blocked` outcome bypasses the failure counter, so the `kanban.failure_limit` circuit breaker never trips on budget-exhaustion loops. Fix: `agent/conversation_loop.py` budget-exhaustion path now calls `_record_task_failure(outcome="timed_out")` instead of `kanban_block`. Budget exhaustion is genuinely a timeout-shaped failure (the task ran out of allowed iterations), so this is more honest semantics; it also routes through the unified failure counter, so repeated budget exhaustions trip the circuit breaker and the task auto-blocks with `gave_up` after `failure_limit` retries. GAP 3: `release_stale_claims` uses `_pid_alive(worker_pid)` only and ignores `last_heartbeat_at`. Reporter observed a 91-min run that held its claim with frozen heartbeat because the worker entered a logic loop with no tool calls — `_pid_alive` kept returning True so the claim was extended every 15 minutes indefinitely. Fix: heartbeat-stale backstop. If `last_heartbeat_at` is set AND older than `DEFAULT_CLAIM_HEARTBEAT_MAX_STALE_SECONDS` (default 1h), reclaim even if the PID is alive. NULL `last_heartbeat_at` preserves backward compatibility (no heartbeat yet = extend, as before). The reclaim event payload now includes a `heartbeat_stale` boolean so operators see why a live-PID worker was reclaimed. This works cleanly in concert with PR #34418 (#31752 runtime → heartbeat bridge): once `_touch_activity` keeps `last_heartbeat_at` fresh as a side effect of normal API traffic, the backstop only fires for genuinely wedged workers (no chunks, no tool results, no progress at all). Co-authored-by: baofuen <45189813+baofuen@users.noreply.github.com>	2026-05-29 00:13:29 -07:00
kshitijk4poor	66827f8947	chore: prune unused imports and duplicate import redefinitions Remove unused imports (F401) and duplicate/shadowed import redefinitions (F811) across the codebase using ruff's safe autofixes. No behavioral changes -- imports only. - ~1400 safe autofixes applied across 644 files (net -1072 lines) - __init__.py re-exports preserved (excluded from F401 removal so public re-export surfaces stay intact) - Re-exports that are imported or monkeypatched by tests but look unused in their defining module are kept with explicit # noqa: F401 (gateway/run.py load_dotenv; run_agent re-exports from agent.message_sanitization, agent.context_compressor, agent.retry_utils, agent.prompt_builder, agent.process_bootstrap, agent.codex_responses_adapter) - Unsafe F841 (unused-variable) fixes deliberately skipped -- those can change behavior when the RHS has side effects - ruff lints remain disabled in pyproject.toml (only PLW1514 is selected); this is a one-time cleanup, not a config change Verification: - python -m compileall: clean - pytest --collect-only: all 27161 tests collect (zero import errors) - core entry points import clean (run_agent, model_tools, cli, toolsets, hermes_state, batch_runner, gateway) - static scan: every name any test imports directly from an edited module still resolves	2026-05-28 22:26:25 -07:00
Teknium	3b6347af15	feat(kanban): default_assignee fallback + per-profile concurrency cap (#27145 , #21582 ) (#34244 ) Two related dispatcher behaviors that have been missing for a while. ## kanban.default_assignee (#27145) Reporter (@agarzon): dashboard creates a task without an assignee, task parks in 'ready' forever even though the operator's intent ('default') is perfectly clear. The dispatcher already had a 'skipped_unassigned' bucket but no fallback routing — users had to manually type 'default' in the assignee field every time. Behavior: when 'kanban.default_assignee' is set in config.yaml, the dispatcher applies that assignee to any unassigned ready task before deciding whether to spawn. The row is mutated (assignee column + an 'assigned' event with source='kanban.default_assignee' for the audit trail). Empty/whitespace config value = no fallback, preserving the existing skipped_unassigned behavior. Dry-run mode reports what WOULD happen via the new 'auto_assigned_default' bucket on DispatchResult, but does NOT mutate the DB — operators using 'hermes kanban dispatch --dry-run' see the routing decision before committing. ## kanban.max_in_progress_per_profile (#21582) Reporter (@edwardchenchen, @simlu, 4 reactions): fan-out workloads saturate one profile's local model / API quota / browser pool while other profiles sit idle. The existing global 'max_in_progress' caps total workers but doesn't balance across profiles. Behavior: when 'kanban.max_in_progress_per_profile' is set to a positive int, the dispatcher tracks per-assignee running counts (one query at tick start) and refuses to spawn for any assignee already at the cap. Tasks blocked this way go to a new 'skipped_per_profile_capped' bucket on DispatchResult as (task_id, assignee, current_running_count) tuples — NOT an operator-actionable failure, just 'try again next tick when the profile has capacity'. Pre-existing 'running' tasks count against the cap (verified via regression test). The cap respects dry_run mode by incrementing its in-memory counter on each would-be spawn so dry_run reports the same balanced subset that a real tick would. Invalid cap values (0, negative, non-int, None) are treated as 'no cap', preserving the existing behavior. Backward-compatible for installs that don't set the config. ## Surfaces - 'hermes kanban dispatch' CLI now prints 'Auto-assigned to kanban.default_assignee=X: ...' and 'Deferred (X at per-profile cap, N running): ...' lines, plus matching JSON keys in --json output. - Gateway dispatcher logs the configured values at startup ('default_assignee=X', 'max_in_progress_per_profile=N'). - 'kanban.max_in_progress_per_profile' added to DEFAULT_CONFIG with inline docs. ## Validation - tests/hermes_cli/test_kanban_default_assignee.py (6 cases): no-cap baseline, auto-assign + DB mutation, dry-run reports without mutating, whitespace treated as None, explicit assignees untouched, DispatchResult field schema. - tests/hermes_cli/test_kanban_per_profile_cap.py (9 cases including 4 parametrized): no-cap baseline, balanced 2-profile fan-out, pre-existing running counts against cap, invalid cap values (0/-1/'abc'/None), capped tasks dispatched on next tick after running task completes, DispatchResult field schema. - Broader kanban suite: 464/464 pass (was 449 baseline; +15 new regression tests across both features). ## Credit #27145 — Jimmy Johansson reported the dispatcher skipped-unassigned gap; @agarzon scoped the simpler 'honor kanban.default_assignee' fix that matches the existing config knob. #21582 — @edwardchenchen filed the per-profile cap ask after hitting model 429s on fan-out research projects; @simlu confirmed the same pain on local-model setups.	2026-05-28 19:02:55 -07:00
teknium1	6f9182cb34	fix(kanban): content-addressed corrupt-DB backup filename Repeated quarantines of an unchanged corrupt kanban.db used to amplify disk usage by N: the gateway dispatcher's 5-minute retry loop, multi- profile fleets sharing one DB, and manual reopen attempts each produced a fresh '.corrupt.<timestamp>.bak' copy of the same bytes. After 10 retries on a 100KB DB you had 11x the disk footprint of duplicate corrupt data. Derive the backup filename from a sha256 of the main DB instead of a timestamp + collision counter. Same bytes → same filename → skip the copy on retries. Different bytes (partial repair, further damage) → different filename → preserve separately. Sidecar (-wal/-shm) backups inherit the same content-addressed name. Inspired by @hanzckernel's PR #33529, simplified down to ~30 LOC: drop the persistent JSON marker file, drop the atomic temp+fsync+rename helper (shutil.copy2 is fine for a quarantine-only path), drop the gateway-side WAL/SHM fingerprint extension (the existing (path, mtime, size) tuple still gives the 5-minute retry semantics it needs), and drop the gateway-side helper extraction. The backup file existing IS the marker; no separate state needed. Test: tests/hermes_cli/test_kanban_db.py::test_repeated_corrupt_open_reuses_single_backup proves 10 retries on the same corrupt bytes produce 1 backup (was 11), and mutating the corrupt bytes produces a second backup with a different fingerprint. Refs #33529 Co-authored-by: hanzckernel <zhicheng.han@mathematik.uni-goettingen.de>	2026-05-28 03:38:09 -07:00
Robin Fernandes	dc52b82d53	test(auth): update entitlement CI expectations	2026-05-28 00:19:31 -07:00
Squiddy	3ba8962738	fix(kanban): add Windows init lock guard	2026-05-27 23:28:51 -07:00
Squiddy	90b6b3d18f	fix(kanban): harden sqlite connection concurrency	2026-05-27 23:28:51 -07:00
teknium1	ebe04c66cd	fix(kanban): close kanban.db FD after every connect() in long-lived processes `sqlite3.Connection.__exit__` commits/rollbacks but does NOT close the underlying FD. `with kb.connect() as conn:` in long-lived processes (gateway `run_slash`, dashboard `decompose_task_endpoint`) therefore leaks one FD to `kanban.db` per call. After enough operations the gateway dies with `[Errno 24] Too many open files` (~4 days uptime in the production report — #33159). Fix: add a `connect_closing()` context manager in `hermes_cli/kanban_db` that wraps `connect()` with a real `try/finally: conn.close()`. Switch the 42 leak-prone call sites in `hermes_cli/kanban.py` (35), `hermes_cli/kanban_decompose.py` (4), and `hermes_cli/kanban_specify.py` (3) over to it. `kanban.py` matters because `run_slash` (called from the gateway for every `/kanban` slash command) parses argparse and dispatches to those `_cmd_*` functions in-process — each one was leaking one FD per invocation. Tests inside `tests/` are untouched: short-lived processes where OS cleanup masks the leak. Regression tests added in `test_kanban_db.py` cover both happy-path and exception-path closure, plus an explicit assertion that bare `with kb.connect()` still does NOT close (documenting the upstream sqlite3 behaviour we're working around). Closes #33159.	2026-05-27 22:07:49 -07:00
Stephen Chin	ffdc937c18	fix(kanban): hoist zombie reaper out of dispatch_once Reaper now runs at the top of every dispatcher tick regardless of per-board connect() failures. Previously the reaper sat inside dispatch_once after the kanban_db.connect() call — any EIO during connect would skip reaping for that tick, accumulating zombie workers and stale claim_lock rows. Also: reap_worker_zombies now returns the list of reaped pids (the dispatcher logs them) and a test indentation fix. Squashes three sibling commits from PR #32301 into one logical change for batch review.	2026-05-27 14:31:55 -07:00
steveonjava	99c19eb2fe	fix(kanban): add post-commit page_count invariant check to write_txn Reads header bytes 28-31 after every COMMIT and compares against actual file size. Raises sqlite3.DatabaseError on torn-extend (actual_pages < page_count). Also sets PRAGMA wal_autocheckpoint=100 in connect(). Refs: #31208 (Bug E - same file, coordinate), #30973 (wal_autocheckpoint) Refs: #30445, #30896, #30908 (corruption reports)	2026-05-27 14:31:55 -07:00
Stephen Chin	c002668ff0	fix(kanban): add grace period to detect_crashed_workers `detect_crashed_workers` calls `_pid_alive` on every `running` task whose claim is held by this host. The check can transiently return False for a freshly-spawned worker (fork → /proc-visibility lag, or reap-race between SIGCHLD and parent reaping). When a second dispatcher ticks inside that window it reclaims the task and spawns a duplicate worker. Add `DEFAULT_CRASH_GRACE_SECONDS = 30` and an `HERMES_KANBAN_CRASH_GRACE_SECONDS` env-var override. `detect_crashed_workers` skips the liveness check when `time.time() - started_at < grace`. The existing 15-minute claim TTL still reclaims genuinely-crashed workers; grace only suppresses the launch-window false positive. `HERMES_KANBAN_CRASH_GRACE_SECONDS=0` is set on the `kanban_home` fixture in `test_kanban_core_functionality.py` so existing tests that assert immediate reclaim retain pre-fix semantics. Companion to merged PR #23442 (`release_stale_claims`, closes #23025), which addressed the same multi-dispatcher race in the stale-claim path. Related: #20015 (`_pid_alive` false-negative behaviour),	2026-05-27 14:31:55 -07:00
Stephen Chin	e83252dc46	fix(kanban): preserve original exception when write_txn rollback fails When code inside a write_txn block raises an OperationalError that SQLite has already auto-rolled-back (typical for disk I/O error, database is locked, and database disk image is malformed), the explicit ROLLBACK in write_txn.__exit__ itself raises cannot rollback - no transaction is active and the secondary exception replaces the original in the traceback. Operators see a misleading error and lose the diagnostic information they need. Swallow the rollback-time OperationalError so the caller always sees the original cause. Confirmed reproducer: tests/hermes_cli/test_kanban_db.py:: test_write_txn_preserves_original_exception_when_rollback_fails	2026-05-27 14:31:55 -07:00
Stephen Chin	6416dd5187	fix(kanban): harden SQLite against torn-write corruption (secure_delete + cell_size_check + synchronous=FULL) Production corruption #6 left b-tree pages with zeroed headers but intact old cell content — the Bug E pattern. This fix applies three pragma calls on every connect(): - synchronous=FULL (was NORMAL): closes the WAL-checkpoint reordering window where a crash between WAL commit and main-DB write leaves a partially-written b-tree page header. Cost is <1ms per commit on local SSD; negligible at kanban write volume. - secure_delete=ON: forces SQLite to zero freed page bytes on disk. If a torn write or hardware fault later corrupts a page, the underlying cell content is zero, so corruption is detectable and no stale rows can resurface as live data. - cell_size_check=ON: adds a read-side guard so corrupt cells surface as errors at read time rather than as silent wrong-data returns. All three are connection-scoped and re-applied on every connect(). secure_delete also writes a persistent flag into the DB header on the first call against a fresh DB, making the protection durable across processes for new DBs. Tests added for all four required cases: each pragma active on a fresh connection, and all three re-applied after close+reopen. Also adds the required negative test (migration path does not reset pragmas).	2026-05-27 14:31:55 -07:00
leeseoki0	ce529d6072	fix(kanban): scratch tasks must not inherit board.default_workdir (#28818 ) Board defaults represent persistent project checkouts. Scratch workspaces are auto-deleted on completion and must stay under the per-board scratch root that resolve_workspace() creates. Inheriting default_workdir for a scratch task pointed the cleanup path at the user's source tree — the data-loss vector documented in #28818. The containment guard in _cleanup_workspace (just added) is the safety rail. This commit prevents the bad state from being created in the first place: only persistent kinds (dir/worktree) inherit board defaults. Tests updated to cover the new semantics: scratch with default_workdir set keeps workspace_path=None; dir/worktree still inherits the board default. Salvaged from PR #31315 by @leeseoki0 — prevention layer on top of the #28819 containment fix by @briandevans. Co-authored-by: teknium1 <127238744+teknium1@users.noreply.github.com>	2026-05-24 15:48:58 -07:00
briandevans	23115b5c0f	fix(kanban): restrict managed-scratch roots to workspaces/ dirs only Copilot review on PR #28819 flagged that `_is_managed_scratch_path` accepted the entire `<kanban_home>/kanban` subtree as managed scratch storage. With that, a task whose `workspace_kind='scratch'` and `workspace_path` was mis-set to `<kanban_home>/kanban`, `.../kanban/logs`, or a board's metadata directory (e.g. `.../kanban/boards/<slug>` without the `workspaces/` child) would pass the containment guard and let task completion `shutil.rmtree` Hermes' own DB, metadata, and log subtrees. Tighten the guard: * Allowed roots are now exclusively `workspaces/` directories — the `HERMES_KANBAN_WORKSPACES_ROOT` override, `<kanban_home>/kanban/workspaces`, and each `<kanban_home>/kanban/boards/<slug>/workspaces` discovered on disk. * Require strict descendancy: a path equal to a root itself is rejected too, because deleting a workspaces root would wipe every task's scratch dir at once. Add a regression test covering the three Copilot-named attack paths (kanban root, kanban/logs, board root without `workspaces/`) plus the workspaces-root-itself case, and confirm the inner task-id dir still matches.	2026-05-24 15:48:58 -07:00
briandevans	80ad1609c8	fix(kanban): refuse to rmtree workspace_path outside managed scratch root (#28818 ) A board's ``default_workdir`` (e.g. ``hermes kanban boards set-default-workdir my-board /path/to/real/source``) is copied into ``tasks.workspace_path`` for tasks created without an explicit ``workspace_kind``. Those tasks default to ``workspace_kind='scratch'``, so completion calls ``_cleanup_workspace`` and unconditionally runs ``shutil.rmtree(wp, ignore_errors=True)`` — deleting the user's real source tree as if it were disposable scratch storage. Add ``_is_managed_scratch_path()`` and gate ``_cleanup_workspace`` on it: only delete paths under ``HERMES_KANBAN_WORKSPACES_ROOT`` (the worker-side override the dispatcher injects) or under the active kanban home's ``kanban/`` subtree (covering both the legacy default-board root and per-board ``kanban/boards/<slug>/workspaces`` roots). Anything else gets a warning log and is left alone, so a misconfigured ``default_workdir`` can no longer destroy user data on task completion.	2026-05-24 15:48:58 -07:00
David Murray	d46adad22f	feat(cli): kanban promote verb for manual todo->ready recovery Adds `hermes kanban promote <task_id>` for manual lifecycle recovery when an auto-promote daemon misses the parent-done transition (issue #28822). Refuses promotion unless every parent dep is done/archived (override with --force). Emits a `promoted_manual` audit event distinct from the automatic `promoted` kind, so audit consumers can filter human-driven from system-driven promotions. Supports --dry-run and --json for orchestration. Does not mutate assignee/claim state — the dispatcher picks the card up via its normal ready polling path. Closes #28822.	2026-05-23 23:10:36 -07:00
Teknium	ad11327db0	feat(kanban): warn users that scratch workspaces are deleted on completion (#30949 ) First scratch workspace creation on an install now emits a one-shot warning log + a 'tip_scratch_workspace' event on the task. Sentinel file at ~/.hermes/kanban/.scratch_tip_shown silences subsequent creations across the whole install. Behavior unchanged — scratch is still ephemeral by design. This just makes the design visible to new users (reported in user community: 'progress files vanished, no warning anywhere'). Docs (en + ko) updated to spell out 'Deleted when the task completes' on the scratch bullet and 'Preserved on completion' on worktree/dir.	2026-05-23 11:27:00 -07:00
teknium1	c4b8f5efee	fix(kanban): harden corrupt-db backup against CodeQL path-injection findings Path.resolve() before any I/O and confine backup writes to the resolved parent directory. Adds explicit parent-equality assertions so static analyzers see the containment guarantee, and walks WAL/SHM sidecars through the same resolved-parent path so accidental .. segments are collapsed before shutil.copy2. Functionally equivalent to the original PR; preserves the corrupt bytes to <db>.corrupt.<ts>.bak in the same directory, still raises KanbanDbCorruptError from connect(). E2E with Stefan's exact hex header + malformed pages still passes. 163/163 kanban tests still pass.	2026-05-23 05:51:33 -07:00
Nick	39fe4ecee3	fix(kanban): refuse corrupt db auto-init	2026-05-23 05:51:33 -07:00
helix4u	1a7bb988fc	fix(gateway): harden kanban and provider cleanup races	2026-05-20 14:31:22 -07:00
xxxigm	34120a0ae2	fix(kanban): worker-initiated block must not be auto-promoted (#28712 ) When a worker calls ``kanban_block(reason="review-required: ...")`` to hand a task off for human review, the dispatcher's ``recompute_ready`` was treating the resulting ``blocked`` status as eligible for auto-promotion — exactly the same as a circuit-breaker block. On the next tick the task flipped back to ``ready``, a fresh worker spawned, found nothing to do (work already applied, review-required comment already posted), exited cleanly, got recorded as ``protocol_violation`` → ``gave_up`` → ``blocked``, and the dispatcher promoted again. Infinite loop until manual ``hermes kanban reclaim`` + ``kanban block``. Add ``_has_sticky_block`` which distinguishes the two block sources using the cheapest available signal: the most recent ``"blocked"``/``"unblocked"`` event in ``task_events``. * Worker / operator ``kanban_block`` emits ``"blocked"`` → ``_has_sticky_block`` returns True → ``recompute_ready`` skips the task entirely. ``unblock_task`` emits ``"unblocked"`` which flips the predicate back, so the only legitimate exit is the documented human-in-the-loop path. * Circuit-breaker ``_record_task_failure`` emits ``"gave_up"`` (not ``"blocked"``) → predicate stays False → original parent-completion-recovery semantics from #`40c1decb3` are preserved. * Tasks blocked purely by direct DB manipulation also recover, since they have no ``"blocked"`` event row at all — matches the existing ``test_recompute_ready_promotes_blocked_with_done_parents`` fixture behaviour.	2026-05-19 17:26:23 -07:00
kshitijk4poor	7552e0f3c0	fix(kanban): also hoist idx_events_run + drop redundant inner create Extends the previous commit to cover the remaining additive-column index that sits on the same migration trap: - ``task_events.run_id`` -> ``idx_events_run`` was still in SCHEMA_SQL. A legacy ``task_events`` table predating #17805 (no ``run_id``) would still abort ``executescript`` before ``_migrate_add_optional_columns`` could add the column. Hoisted out of SCHEMA_SQL and made unconditional in the migration alongside the other three indexes. - Removed the now-redundant ``CREATE INDEX idx_tasks_idempotency`` that was nested inside the ``if "idempotency_key" not in cols`` branch. The unconditional create lower in the function makes it idempotent on both fresh and legacy DBs. - Strengthened the regression test to cover all four indexes (``idx_tasks_session_id``, ``idx_tasks_tenant``, ``idx_tasks_idempotency``, ``idx_events_run``) and to seed a pre-#17805 ``task_events`` shape that exercises the ``run_id`` migration path. The result: every ``CREATE INDEX`` that depends on an additive column now runs after the migration ensures the column exists. Verified against a realistic pre-#16081 board fixture (tasks + task_events both legacy shape) — origin/main reproduces ``no such column: session_id``; this branch migrates cleanly and creates all four indexes.	2026-05-19 08:09:11 -07:00
Michael Nguyen	7c622b6c74	fix(kanban): migrate task session index after columns	2026-05-19 08:09:11 -07:00
Teknium	7bcdced6c1	fix(kanban): respawn guard defers blocker_auth instead of auto-blocking (#28683 ) Follow-up to #28455. The respawn guard's blocker_auth rule (last error matched a quota/auth/429 pattern) was auto-blocking the task on first occurrence. That's too aggressive: transient rate limits typically clear in seconds to minutes, but the auto-block puts the task in 'blocked' status which requires manual unblock. Now treats blocker_auth the same as recent_success and active_pr: defer the spawn this tick, leave the task in 'ready', let the next tick try again. If the auth error genuinely persists, the existing consecutive_failures counter trips the auto-block circuit breaker after failure_limit failures via the normal path — so a persistent 401/403/quota-exhausted still ends up blocked, just not on first hit. Also documents the respawn_guarded event in kanban.md's events table with the three guard reasons. Updated test_dispatch_respawn_guard_auto_blocks_auth_error → renamed to test_dispatch_respawn_guard_defers_auth_error_without_auto_block; asserts task stays in 'ready' and the guard reason is recorded.	2026-05-19 03:27:45 -07:00
Teknium	88ee58f7d2	fix(kanban): stale reclaim must not tick failure counter (#28680 ) Follow-up to #28452. detect_stale_running() was calling _record_task_failure() on every reclaim, which ticked the consecutive_failures counter. With the default failure_limit=2, two legitimately long-running tasks (>4 h without explicit heartbeat) would auto-block via the spawn-failure circuit breaker — even though no worker actually failed. Stale reclaim is dispatcher-side absence-of-heartbeat detection, not a worker fault. Removed the _record_task_failure() call; the 'stale' event in task_events is still the audit surface, but the failure counter is now reserved for spawn_failed / timed_out / crashed (real failures). Also documents the heartbeat requirement: - KANBAN_GUIDANCE in agent/prompt_builder.py now states the rule ('call kanban_heartbeat at least once an hour for tasks running longer than 1 hour') so workers learn the contract. - kanban.md adds the stale event row to the events table and flags the heartbeat requirement in the worker lifecycle list. New regression test: test_detect_stale_does_not_tick_failure_counter locks in the new behaviour.	2026-05-19 03:15:18 -07:00
Jpalmer95	dfcf48b476	feat(kanban): drag-to-delete trash zone + bulk delete for task cards Salvages #28125 by @Jpalmer95. Adds: - Drag-to-delete trash zone in the kanban dashboard - Bulk delete endpoint with cascading delete_task cleanup - Frontend updates (drag visual + drop handler) - Confirmation prompt before delete Resolved end-of-file test conflict by appending both halves.	2026-05-18 21:40:13 -07:00

1 2 3

141 commits