hermes-agent

mirror of https://github.com/NousResearch/hermes-agent.git synced 2026-05-29 06:31:32 +00:00

Author	SHA1	Message	Date
LeonJS	9f008bcd5c	fix(kanban): release scratch workspace and tmux session on task completion Salvages #27369 by @LeonJS. complete_task() now calls _cleanup_workspace() and _cleanup_worker_tmux() after marking a task complete. Scratch workspaces (used by swarm agents) accumulate on disk — hundreds of MB per task, never released. Stale tmux sessions from completed agents also persist indefinitely. Both gates are safe: - workspace_kind == 'scratch' gate preserves user worktree/dir workspaces - tmux #{pane_dead} == 1 gate only kills sessions where the worker has already exited - best-effort: cleanup failures never block task completion	2026-05-18 20:45:29 -07:00
shunsuke-hikiyama	fb96208892	feat(kanban): add initial-status for human-ops cards Salvages #27526 by @shunsuke-hikiyama. Adds an --initial-status flag (running\|blocked, default running) to 'kanban create', threaded through kanban_db.create_task() and the kanban_create tool schema. 'blocked' parks the task directly in the blocked column for R3 human-ops review, skipping the brief running-to-blocked transition. Dropped the unrelated 'add' alias, WIFEXITED Windows compat, and slash-handler error formatting changes that were bundled in the original PR — those should ship as their own focused changes if still wanted.	2026-05-18 20:44:02 -07:00
oemtalks	b9d38a56dd	fix(kanban): don't crash dispatched workers when kanban-worker skill is absent Salvages #27372 by @oemtalks. The dispatcher unconditionally injected `--skills kanban-worker` into every worker spawn, but worker profiles sometimes don't have that bundled skill in their skills dir, which is fatal at CLI startup (`ValueError: Unknown skill(s): kanban-worker`). Adds `_kanban_worker_skill_available(hermes_home)` and only injects the flag when the skill resolves. The MANDATORY lifecycle still ships via KANBAN_GUIDANCE in the system prompt, so omitting the flag is safe.	2026-05-18 20:32:20 -07:00
Ade5954	0392cf53b5	fix(kanban): close sqlite connection on init failure to prevent fd leak Salvages #28301 by @Ade5954. If WAL setup, PRAGMA application, or schema init raises after sqlite3.connect() succeeds, the new connection was leaking. Wrap the body in try/except so the connection is closed before the exception propagates.	2026-05-18 20:30:56 -07:00
DoGMaTiiC	4da4133d34	fix: assign single-task kanban decompositions	2026-05-18 20:26:02 -07:00
roycepersonalassistant	6c4f11c64a	fix: show scheduled kanban tasks in dashboard	2026-05-18 20:25:45 -07:00
hanzckernel	5d079fee17	fix: harden Kanban worker Hermes command resolution	2026-05-18 20:25:09 -07:00
ht1072	0b547aea03	fix(kanban): make legacy task migration idempotent (cherry picked from commit 293f1c3a7241b0117669e049d9aa746c9645ac90)	2026-05-18 20:24:53 -07:00
zccyman	fe5e0bf5a3	feat(kanban): add board-level default workdir (#25430 )	2026-05-18 20:24:04 -07:00
LeonSGP43	8bfb456948	fix(kanban): pass accept-hooks to worker chat subprocess	2026-05-18 20:23:47 -07:00
LeonSGP43	0f620138b0	fix(kanban): make claim ttl configurable Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-05-18 20:23:31 -07:00
Interstellar-code	d8ad431de8	fix(kanban): task_age() tolerates ISO-8601 timestamps Prevents ValueError crash in dashboard get_board() when a task has an ISO timestamp (e.g. "2026-05-10T15:00:00Z") instead of a unix epoch int. Adds _to_epoch() helper that normalises both formats.	2026-05-18 20:18:04 -07:00
psionic73	ca8126bd53	fix(kanban): serialize DB initialization	2026-05-18 20:17:48 -07:00
soynchux	9281599b6f	fix(kanban): align board_exists with board discovery rules	2026-05-18 20:17:10 -07:00
bradhallett	de9bcfc6a0	fix(kanban): fingerprint crash errors to prevent fleet-wide retry exhaustion When a systemic failure (provider outage, auth expiry, OOM) crashes multiple workers simultaneously, detect_crashed_workers increments each task failure counter independently. The circuit breaker only trips after N × failure_limit retries across the fleet. Fingerprint crash errors by normalizing host-specific details (PIDs, timestamps). When 3+ tasks crash with the same fingerprint in a single detection cycle, immediately trip the circuit breaker (failure_limit=1) instead of waiting for repeated failures. Isolated crashes (unique fingerprints) retain their normal retry budget. Protocol violations continue to trip immediately. Includes regression tests for systemic and isolated crash paths.	2026-05-18 20:16:50 -07:00
bradhallett	f042931852	fix(kanban): reset failure counters on unblock_task When a task is manually unblocked (blocked → ready/todo), the consecutive_failures counter and last_failure_error were left intact. The next failure would immediately re-trip the circuit breaker because the counter was still at or above the failure limit. Reset both fields on unblock so the task gets a fresh retry budget. Includes a regression test that verifies counters are zeroed.	2026-05-18 20:16:32 -07:00
sprmn24	5db0d72c90	fix(kanban): use 'is not None' check for max_runtime_seconds in create_task max_runtime_seconds=0 was being silently coerced to None due to a falsy check (if max_runtime_seconds). Zero is a valid value that causes the dispatcher to immediately time out a task. The adjacent max_retries parameter already used the correct 'is not None' pattern. Fixes the inconsistency by aligning max_runtime_seconds with max_retries.	2026-05-18 20:16:15 -07:00
bradhallett	40c1decb3b	fix(kanban): promote blocked tasks when parent dependencies complete recompute_ready only scanned 'todo' tasks for promotion, ignoring 'blocked' tasks entirely. When a task was blocked (e.g. by the circuit breaker) and its parent dependencies later completed, the task stayed stuck in 'blocked' forever unless manually unblocked. Now recompute_ready also scans 'blocked' tasks. When all parents are done/archived, the blocked task is promoted to 'ready' with failure counters reset — equivalent to an automatic unblock. Includes a regression test for the blocked-parent-done promotion path.	2026-05-18 20:15:55 -07:00
Zyrixtrex	b7ea62e5d3	fix(kanban): promote dependents when a parent is archived	2026-05-18 20:15:03 -07:00
Zyrixtrex	326c15d955	fix(kanban): preserve notifier_profile for dashboard home subscriptions	2026-05-18 20:14:45 -07:00
QuenVix	8a64e1580b	fix(kanban): ignore stale HERMES_KANBAN_BOARD for removed boards	2026-05-18 20:14:10 -07:00
briandevans	d62964cdfa	fix(kanban): clear _INITIALIZED_PATHS in remove_board so recycled DBs re-init schema Archiving or deleting a board via remove_board() leaves the path's "schema already initialized" entry in the module-level cache. A concurrent connect(board=<slug>) call (e.g. the dashboard event-stream poll loop) then: 1. resolves the same kanban.db path, 2. recreates the directory + an empty sqlite file because connect() does mkdir(parents=True, exist_ok=True), 3. skips the CREATE TABLE pass because the cache entry says the schema is already in place, 4. errors on the next read with `no such table: task_events`. Drop the cache entry before mutating the filesystem so the fresh file gets a proper schema init on next connect(). Applies to both archive=True (rename) and archive=False (rmtree) branches. Fixes #23833. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 20:13:17 -07:00
hongchen1993	f01ee0b575	feat: per-task model override for kanban workers - Add model_override field to Task class and tasks schema - Add migration for existing databases - Spawn worker with -m model when model_override is set	2026-05-18 20:12:28 -07:00
Beandon13	bde6313e34	feat(kanban): archive --rm to hard-delete archived tasks Salvages #19964 by @Beandon13. Adds `hermes kanban archive --rm` to permanently remove already-archived tasks with cascading cleanup of links, comments, events, runs, and notify-subs. Safety guard: only archived tasks can be deleted; active/blocked/done must be archived first. Cherry-picked from #19964 onto current main (severe stale base, applied manually to preserve substance only).	2026-05-18 20:09:26 -07:00
zccyman	2e09d2567c	feat(kanban): add auto_promote_children config toggle When the kanban auto-decomposer fans a triage task into child tasks, recompute_ready() immediately promotes parent-free children to 'ready' so the dispatcher picks them up. Some users want a manual workflow where children stay in 'todo' for review before dispatch. Add 'kanban.auto_promote_children' config key (default: true): - false: children stay in 'todo' after decomposition - true: existing behavior (auto-promote to 'ready') Changes: - kanban_db.py: decompose_triage_task() gains auto_promote param - kanban_decompose.py: reads auto_promote_children from config - kanban dashboard API: exposes the new setting in GET/PUT /orchestration Closes #28016	2026-05-18 20:04:32 -07:00
EloquentBrush0x	502d03d5a3	fix(kanban): detect cycles in decompose_triage_task sibling-link pre-validation decompose_triage_task inlines SQL INSERTs for atomicity and intentionally bypasses link_tasks() — which calls _would_cycle() per edge. If the LLM emits a cyclic parent graph (e.g. A.parents=[1], B.parents=[0]) the DB write succeeds but every involved child deadlocks in 'todo' forever: recompute_ready() requires all parents to be done, which is impossible when A waits for B and B waits for A. Add a Kahn topological sort over the sibling parent indices in the pre-validation block, before any DB writes. Mirrors the cycle-safety guarantee that link_tasks() provides for manually linked tasks.	2026-05-18 09:40:44 -07:00
Teknium	f2fdb9a178	feat(gateway): deliverable mode — ship artifacts as native uploads from any agent surface (#27813 ) The agent can now produce a chart, PDF, spreadsheet, or any other supported file type and have it land in Slack / Discord / Telegram / WhatsApp / etc. as a native attachment, just by mentioning the absolute path in its response. Same primitive works for kanban-worker completions: workers attach artifacts via kanban_complete(artifacts=[...]) and the gateway notifier uploads them alongside the completion message. Changes: - gateway/platforms/base.py: extract_local_files now covers PDFs, docx, spreadsheets (xlsx/csv/json/yaml), presentations (pptx), archives (zip/tar/gz), audio (mp3/wav/...), and html — not just images and video. Image/video extensions still embed inline; everything else routes to send_document via the existing dispatch partition in gateway/run.py. - tools/kanban_tools.py + hermes_cli/kanban_db.py: kanban_complete gains an explicit ``artifacts`` parameter. The handler stashes it in metadata.artifacts (for downstream workers) and the kernel promotes it onto the completed-event payload so the notifier can find it without a second SQL round-trip. - gateway/run.py: _kanban_notifier_watcher now calls a new helper _deliver_kanban_artifacts after sending the completion text. The helper reads payload.artifacts (preferred), falls back to scanning the payload summary and task.result with extract_local_files, then partitions images / videos / documents and uploads each via send_multiple_images / send_video / send_document. - website/docs/user-guide/features/deliverable-mode.md + sidebars.ts: user-facing docs page covering the extension list, the kanban artifacts pattern, and the MCP-for-connector-breadth recommendation. Tests: - tests/gateway/test_extract_local_files.py: 7 new test cases (documents, spreadsheets, presentations, audio, archives, html, chart-pdf canonical case). 44 passing, 0 regressions. - tests/tools/test_kanban_tools.py: 4 new cases covering the artifacts arg shape (list / string / merge with existing metadata / type rejection). 17 passing. - tests/hermes_cli/test_kanban_notify.py: 2 new cases covering full notifier → artifact-upload path and missing-file silent-skip. 12 passing. - E2E (real files, real kanban kernel, real BasePlatformAdapter): worker calls kanban_complete(artifacts=[png,pdf,csv]) → metadata + event payload land → notifier helper partitions correctly → send_multiple_images called once with the PNG, send_document called twice with PDF + CSV. What's NOT in this PR (deferred to follow-ups): - Ad-hoc "research this for two hours, ping the thread when done" slash command — covered today by kanban subscriptions; a dedicated slash command can ride a follow-up PR if needed. - Setup-wizard prompt for recommended MCP servers (Notion, GitHub, Linear, etc.) — docs page lists them; UI is a separate change. Plan and rationale captured in ~/.hermes/docs/perplexity-computer-parity.pdf (local doc, not shipped).	2026-05-18 02:14:43 -07:00
qWaitCrypto	6e60a8a092	feat(kanban): make worker log retention configurable	2026-05-18 01:21:41 -07:00
qWaitCrypto	8831eb5c70	fix(kanban): align worker terminal timeout with task runtime	2026-05-18 01:20:52 -07:00
Teknium	1345dda0cf	feat(kanban): orchestrator-driven auto-decomposition on triage (#27572 ) * feat(kanban): orchestrator-driven auto-decomposition on triage Closes the core gap in the kanban system: dropping a one-liner into Triage now decomposes it into a graph of child tasks routed to specialist profiles by description, matching teknium's original vision ("main orchestrator splits/creates actual tasks, doles them out to each agent"). The build --------- - hermes_cli/profiles.py: new `description` + `description_auto` fields on ProfileInfo, persisted in <profile_dir>/profile.yaml. Helpers read_profile_meta / write_profile_meta. `create_profile` accepts optional description. - hermes_cli/profile_describer.py: new module — auto-generate a 1-2 sentence description from a profile's skills + model + name via the auxiliary LLM (`auxiliary.profile_describer`). - hermes_cli/main.py: new `hermes profile create --description ...` flag; new `hermes profile describe [name] [--text ... \| --auto \| --all --auto]` subcommand. - hermes_cli/kanban_db.py: new `decompose_triage_task` atomic helper — creates N child tasks, links the root as a child of every leaf (root waits for the whole graph), flips root `triage -> todo` with orchestrator assignee, records an audit comment + `decomposed` event in a single write_txn. - hermes_cli/kanban_decompose.py: new module — calls the auxiliary LLM (`auxiliary.kanban_decomposer`) with the profile roster + descriptions to produce a JSON task graph, then invokes the DB helper. Rewrites unknown assignees to the configured `kanban.default_assignee` (or the active default profile) so a task NEVER lands with assignee=None. Falls back to specify-style single-task promotion when the LLM returns `fanout: false`. - hermes_cli/kanban.py: new `hermes kanban decompose [task_id \| --all]` CLI verb. - hermes_cli/config.py: new DEFAULT_CONFIG keys — kanban.orchestrator_profile, kanban.default_assignee, kanban.auto_decompose (default True), kanban.auto_decompose_per_tick (default 3), auxiliary.kanban_decomposer, auxiliary.profile_describer. - gateway/run.py: kanban dispatcher watcher now runs auto-decompose before each `_tick_once`, capped by `auto_decompose_per_tick` so a bulk-load of triage tasks doesn't burst-spend the aux LLM. - plugins/kanban/dashboard/plugin_api.py: new endpoints — GET /profiles (list roster + descriptions), PATCH /profiles/<name> (set description, user-authored), POST /profiles/<name>/describe-auto (LLM-generate), POST /tasks/<id>/decompose (run decomposer), GET/PUT /orchestration (orchestrator/default-assignee/auto-decompose pickers, with resolved fallbacks echoed back). - plugins/kanban/dashboard/dist/index.js: new OrchestrationPanel collapsible — dropdowns for orchestrator profile and default assignee, auto-decompose toggle, per-profile description editor with Save and Auto-generate buttons. New ⚗ Decompose button next to ✨ Specify on triage-column task drawers. Behavior -------- - A task in Triage gets fanned out into a small DAG of child tasks. Children with no internal parents flip to `ready` immediately (parallel dispatch). Children with sibling parents wait. The root stays alive as a parent of every child — when the whole graph finishes, it promotes to `ready` and the orchestrator profile wakes back up to judge completion (the "adds more tasks until done" part of the original vision). - `kanban.orchestrator_profile` unset -> falls back to the default profile (whichever `hermes` launches with no -p flag). - `kanban.default_assignee` unset -> same fallback. Tasks NEVER end up unassigned. - `kanban.auto_decompose=true` (default) runs the decomposer automatically on dispatcher ticks; manual `hermes kanban decompose` is always available. Tests ----- - tests/hermes_cli/test_kanban_decompose_db.py — 7 tests for the atomic DB helper (status transitions, dep graph, audit trail, validation errors). - tests/hermes_cli/test_kanban_decompose.py — 6 tests for the decomposer module (fanout, no-fanout fallback, unknown-assignee rewrite, malformed-JSON resilience, no-aux-client path). - tests/hermes_cli/test_profile_describer.py — 10 tests for profile.yaml r/w + the LLM auto-describer (yaml corrupt tolerance, user-vs-auto description protection, --overwrite, fallback parsing). E2E --- - CLI end-to-end: created profiles with descriptions, dropped a triage task, mocked the aux LLM with a 3-task graph -> verified all three children were created with the right assignees, the dependency edges matched the LLM's graph, root flipped to todo gated by every child, audit comment + `decomposed` event recorded. - Dashboard end-to-end: started the dashboard against an isolated HERMES_HOME, verified all four new endpoints via curl (profile listing, PATCH for description, PUT for orchestration settings, POST for decompose). Opened the UI in the browser, confirmed the OrchestrationPanel renders with all three pickers + the per-profile description editor, typed a description, clicked Save, verified ~/.hermes/profile.yaml was written. Clicked Decompose on the triage card and confirmed the inline error message surfaced as designed ("no auxiliary client configured"). * feat(kanban): surface decompose mode (Auto/Manual) as a one-click pill The auto/manual toggle already existed as kanban.auto_decompose (default true), but it was buried inside the collapsed Orchestration settings panel — users couldn't tell at a glance which mode they were in. This hoists it to a pill at the top of the kanban page so the state is always visible and one click flips it. UX - New "⚗ Decompose: AUTO\|MANUAL" pill in the kanban header. Emerald styling when Auto is on (the default), muted/gray when Manual. - Pill is visible both in the collapsed AND expanded Orchestration settings views so context is preserved when the user opens the panel. - Tooltip explains both states + what clicking does. - Renamed the in-panel "Auto-decompose on triage / Enabled" checkbox to "Decompose mode / Auto (default) \| Manual" for language parity with the pill. Behavior preserved - Default remains Auto (kanban.auto_decompose=true). - Manual mode restores pre-PR behavior: triage tasks stay in triage until the user clicks ⚗ Decompose on each card (or runs `hermes kanban decompose <id>`). Implementation - plugins/kanban/dashboard/dist/index.js: load /orchestration on mount (not just on expand) so the collapsed pill reflects real state. Render mode pill in both collapsed and expanded headers. Reuses the existing PUT /api/plugins/kanban/orchestration endpoint — no new backend, no new tests required. E2E verified - Pill renders as "⚗ Decompose: AUTO" on page load (default). - One click flips to "⚗ Decompose: MANUAL" with muted styling. - config.yaml on disk shows auto_decompose: false after the flip. - Second click round-trips back to Auto; config.yaml flips to true. * feat(kanban): rename mode pill to "Orchestration: Auto/Manual" Per Teknium feedback — "Decompose" was too implementation-specific. "Orchestration" is the user-facing concept (the whole pitch is the orchestrator profile routing work), and the pill is the front door to it. - Pill text: "Orchestration: Auto" / "Orchestration: Manual" (title case, no ⚗ prefix, no SHOUTY-CAPS for the mode value) - In-panel checkbox label: "Orchestration mode" (was "Decompose mode") - Tooltips updated to match - No behavior change * docs(kanban): document decompose, profile descriptions, orchestration mode Brings the docs site up to parity with the PR. English build verified locally (npx docusaurus build --locale en) — clean, no new broken links or anchors. Pre-existing broken-link warnings (rl-training, llms.txt, step-by-step-checklist, fallback-model) untouched. - website/docs/reference/cli-commands.md + `hermes kanban decompose` action row in the action table, with pointer to the Auto vs Manual orchestration section. - website/docs/reference/profile-commands.md + `--description "<text>"` flag on `hermes profile create`. + Full `hermes profile describe` section: read, --text, --auto, --overwrite, --all flags with examples. - website/docs/user-guide/features/kanban.md (the big one) + Triage column intro rewritten around the Auto-decompose default behavior, with pointer to the new Auto vs Manual section. + Status action row updated to mention both ⚗ Decompose and ✨ Specify on triage cards. + New "Auto vs Manual orchestration" section explaining the two modes, how to flip them (pill, config), how routing-by-description works, the no-None-assignee guarantee, plus a config knob table (auto_decompose, auto_decompose_per_tick, orchestrator_profile, default_assignee) and the two new auxiliary slots (kanban_decomposer, profile_describer). + REST surface table gains 6 new endpoint rows: /tasks/:id/decompose, /profiles (GET), /profiles/:name (PATCH), /profiles/:name/describe-auto, /orchestration (GET + PUT). - website/docs/user-guide/features/kanban-tutorial.md + Triage column blurb updated for Auto by default + Manual via the pill, with cross-link to the Auto vs Manual orchestration section. - website/docs/user-guide/profiles.md + Blank-profile flow now mentions --description and points to the kanban routing model for context. - website/docs/user-guide/configuration.md + `kanban_decomposer` and `profile_describer` added to the `hermes model -> Configure auxiliary models` menu listing.	2026-05-17 13:54:12 -07:00
Grogger	8bf09455dc	fix(windows): suppress console window flash on subprocess spawns Add creationflags=CREATE_NO_WINDOW to every Windows Popen call across the terminal, process registry, code execution, and kanban worker subsystems. Prevents visible CMD windows from flashing on the user's desktop during agent operation. Also adds the _IS_WINDOWS module constant to kanban_db.py where it was missing, for consistency with the other patched files. 5 Popen sites across 4 files: - tools/environments/local.py (terminal foreground spawn) - tools/process_registry.py (background process spawn) - tools/code_execution_tool.py (sandbox + interpreter probe) - hermes_cli/kanban_db.py (kanban worker spawn)	2026-05-16 23:05:27 -07:00
kshitij	2ec8d2b42f	chore: ruff auto-fix PLR6201 — tuple → set in membership tests (#23937 ) Replace with for all literal-tuple membership tests. Set lookup is O(1) vs O(n) for tuple — consistent micro-optimization across the codebase. 608 instances fixed via `ruff --fix --unsafe-fixes`, 0 remaining. 133 files, +626/-626 (net zero).	2026-05-11 11:13:25 -07:00
kshitijk4poor	5712483487	fix: guard resolve_profile_env against missing profile dirs The _default_spawn HERMES_HOME injection (PR #23356) calls resolve_profile_env which raises FileNotFoundError when the profile dir doesn't exist. In production the profile always exists (workers are only dispatched for live profiles), but tests with isolated HERMES_HOME never create profile dirs. Catch FileNotFoundError and fall through — HERMES_PROFILE is still set below, so the worker CLI resolves the profile at startup.	2026-05-11 06:44:58 -07:00
Ninso112	a1854ac07c	fix(kanban): treat archived parent tasks as terminal for dependency resolution When a parent task is archived, dependent child tasks were stuck in todo forever because recompute_ready and claim_task only checked for status == 'done'. Now both functions also treat 'archived' as a terminal status, allowing children to proceed when their parent is archived. Fixes #23180.	2026-05-11 06:44:58 -07:00
TurgutKural	5af315c4cc	fix(kanban): inject HERMES_HOME into worker subprocess env Default spawn did not propagate HERMES_HOME when forking kanban workers. The worker's env is copied from the parent via dict(os.environ), so HERMES_HOME is absent. When the child then starts hermes -p <profile>, the CLI's _apply_profile_override() runs before hermes_constants is imported and get_hermes_home() falls back to ~/.hermes (the default profile root), silently ignoring the profile's config.yaml. Profile- scoped fallback_providers, toolsets, and agent settings are therefore never applied to kanban workers. The fix injects HERMES_HOME into the worker's env using resolve_profile_env(profile_arg) so the child reads the correct profile directory instead of the default root.	2026-05-11 06:44:58 -07:00
Mike Nguyen	ba5640fa11	fix(gateway): route kanban notifications to creator profile	2026-05-10 20:04:53 -07:00
konsisumer	88588b6159	fix(kanban): extend stale claim instead of killing live worker Workers running slow models (e.g. kimi-k2.6) can spend longer than DEFAULT_CLAIM_TTL_SECONDS inside a single tool-free LLM call, making no tool calls and therefore not heartbeating. release_stale_claims previously reclaimed these healthy workers, producing the spawn-then-immediately-reclaim loop reported in #23025. When a stale-by-TTL claim's host-local worker PID is still alive, extend the claim (emit a claim_extended event) rather than killing it. enforce_max_runtime / detect_crashed_workers remain the upper bounds for genuinely wedged or dead workers. Reclaim events now also record claim_expires, last_heartbeat_at, worker_pid, and host_local so operators can see why a worker was killed.	2026-05-10 15:23:04 -07:00
Mike Nguyen	861ce7c0b6	fix: dedupe kanban notifier delivery claims	2026-05-10 13:19:41 -07:00
Teknium	3fbbf58853	docs(kanban): document max_spawn as live concurrency cap (not per-tick budget) Follow-up to the previous commit's behavior fix. Adds a paragraph to dispatch_once's docstring making the concurrency-cap semantic explicit, and an inline comment near the running_count query explaining why we do the count (so a future reader doesn't refactor it back to per-tick semantics thinking it's redundant). Both call out the unbounded-accumulation failure mode that motivated the fix, since nothing in the codebase or skills currently documents what max_spawn is supposed to mean. The semantic is per-board: each kanban board has its own SQLite file, so the running-count COUNT(*) is naturally scoped to the board the dispatcher tick is processing.	2026-05-10 09:13:07 -07:00
guglielmofonda	845be254ec	fix(kanban): cap dispatch by running workers	2026-05-10 09:13:07 -07:00
Teknium	1f5983c4c8	feat(kanban): aggregate all toolset-name typos in skills before raising Follow-up to the previous commit's toolset-vs-skill validation. The contributor's fix raises ValueError on the first toolset name found in the skills list. That works for one mistake, but agents that confuse skills with toolsets usually pass several at once (`skills=["web", "browser", "terminal"]`) — and serial-correcting one per failure round-trip wastes tokens. Collect all toolset-shaped entries first, then raise once with the full list. The error message is also slightly clearer: 'web', 'browser', 'terminal' are toolset names, not skill name(s). Put toolsets in the assignee profile's `toolsets:` config instead of per-task skills. Skills are named skill bundles (e.g. `kanban-worker`, `blogwatcher`); toolsets are runtime capabilities (e.g. `web`, `browser`, `terminal`). vs. the previous "the assignee profile's toolsets" — explicitly naming the YAML key (`toolsets:`) and giving concrete examples in both categories closes the conceptual gap that produced the bug to begin with. Adds one regression test (test_create_task_skills_lists_all_toolset_typos) covering the multi-name aggregation path. The single-typo test from the original PR still passes (the loose `match="toolset name"` matches both singular and plural forms).	2026-05-10 08:41:28 -07:00
LeonSGP43	673418dfa1	fix(kanban): reject toolset names in task skills	2026-05-10 08:41:28 -07:00
baocin	061a183008	fix(kanban): guard task_age against corrupt created_at values like '%s' task_age() crashed with ValueError when created_at contained the literal format string '%s' instead of a Unix timestamp, taking down the entire GET /board endpoint with a 500. - Add _safe_int() helper that returns None on non-numeric values - Refactor task_age() to use _safe_int instead of bare int() casts - Wrap task_age() call in _task_dict with try/except fallback so one corrupt row never kills the whole board endpoint	2026-05-10 07:15:59 -07:00
Teknium	62b1c74cbc	fix(kanban): correct dispatcher spawn module name + PATH-first lookup Follow-up to the previous commit's contributor cherry-pick. The cherry-picked change replaced the bare ``["hermes", ...]`` spawn with ``[sys.executable, "-m", "hermes", ...]``. The intent was right (avoid PATH dependence — cron, systemd User= services, launchd jobs, and other detached dispatcher invocations routinely run with a stripped $PATH that doesn't include the venv's bin/, breaking the bare-shim spawn) but the module name is wrong: there is no top-level ``hermes`` package. The console-script entry point in pyproject.toml is ``hermes = "hermes_cli.main:main"``, and ``python -m hermes`` fails with ``No module named hermes``. The cherry-picked form would have replaced a sometimes-broken spawn with an always-broken one. This commit: - Adds ``_resolve_hermes_argv()``, mirroring ``gateway.run._resolve_hermes_bin``. Tries ``shutil.which("hermes")`` first (preferred — keeps existing ``ps`` output and log lines familiar in the common case) and falls back to ``[sys.executable, "-m", "hermes_cli.main"]`` when the shim is not on PATH. The fallback goes through the running interpreter so it's PATH-independent. Kept as a local helper rather than imported from gateway because ``hermes_cli`` sits below ``gateway`` in the dependency order. - Switches the dispatcher's ``cmd`` list to use ``_resolve_hermes_argv()``. - Adds three regression tests: ``test_resolve_hermes_argv_prefers_path_shim`` — pins the PATH-first branch so a future refactor doesn't silently flip the order. * ``test_resolve_hermes_argv_falls_back_to_module_form_when_no_path_shim`` — pins the correct module name (``hermes_cli.main``, NOT ``hermes``). Direct regression guard for the form that shipped in the original PR. * ``test_resolve_hermes_argv_module_actually_runs`` — runs the fallback invocation as a real subprocess and asserts ``--version`` works, so losing ``hermes_cli.main``'s ``__main__`` handling can't slip past the string-match test. Verified end-to-end: with the shim on PATH the resolver returns ``[/.../hermes]`` and ``--version`` works; with the shim removed the resolver returns ``[python, -m, hermes_cli.main]`` and ``--version`` still works; the original PR's ``python -m hermes`` invocation fails as expected (``No module named hermes``).	2026-05-10 07:10:47 -07:00
Wali Reheman	d3db6724dd	fix(kanban): use sys.executable -m hermes for dispatcher spawn In NixOS container mode, hermes is installed at a store path with no symlink on PATH (e.g. /data/current-package/bin/hermes). The kanban dispatcher spawns workers via _default_spawn() using a bare 'hermes' subprocess call, which fails with 'hermes executable not found on PATH' in container mode. Fix by calling sys.executable -m hermes instead, which is guaranteed to resolve to the same Python interpreter running the dispatcher.	2026-05-10 07:10:47 -07:00
Wesley Simplicio	78698381af	fix(kanban): make _migrate_add_optional_columns idempotent on concurrent open ALTER TABLE calls inside _migrate_add_optional_columns were guarded by a snapshot of PRAGMA table_info taken at function entry. When the gateway dispatcher opens the kanban DB twice per tick (once in _tick_once_for_board and once via init_db's discard-and-reconnect path), a second connection can run the same migration before the first one commits, causing: sqlite3.OperationalError: duplicate column name: consecutive_failures This crashed the dispatcher on every first tick after a gateway restart (subsequent ticks succeeded because the columns were then present). Fix: introduce _add_column_if_missing() which wraps ALTER TABLE in a try/except that swallows OperationalError whose message contains 'duplicate column name'. All ALTER TABLE calls in _migrate_add_optional_columns are routed through this helper. Closes #21708	2026-05-09 13:36:23 -07:00
Teknium	ade5981429	fix(kanban): sanitize comment author rendering in build_worker_context (#22769 ) Operator-controlled HERMES_PROFILE values were rendered as '${author} (${ts}):' — markdown bold with no provenance prefix. Worker comment bodies render directly underneath. A misleading profile name like 'hermes-system' or 'operator' could be misread by the next worker as a system directive above attacker-influenced content (confused-deputy primitive gated on operator misconfig). The LLM-controlled author-forgery surface was already closed in #22435 (author removed from KANBAN_COMMENT_SCHEMA). This is defense-in-depth: render with an explicit 'comment from worker `<author>` at <ts>:' prefix so even 'hermes-system' resolves to 'comment from worker `hermes-system` at ...' — parseable as worker-comment metadata, not a system directive. Strip backticks from author so they can't break out of the fence. Update test_build_worker_context_caps_comments to count by body regex since the rendered author line now also starts with 'comment '. Closes #22452.	2026-05-09 12:47:58 -07:00
Matthew Cater	cda20eec0c	fix(kanban): gate claim + unblock on parent completion Enforce the parent-completion invariant at claim_task (the single ready->running chokepoint) and re-gate unblock_task so blocked->ready only fires when parents are done. Prevents child tasks from running ahead of in-progress parents under the create-then-link race. Also adds a stress test that races concurrent create+link against hammered claim_task and asserts no child runs while any parent is undone. Ref: kanban/boards/cookai/workspaces/t_a6acd07d/root-cause.md Refs: t_8d6af9d6	2026-05-09 11:07:37 -07:00
Wesley Simplicio	0c22434f03	fix(kanban): call recompute_ready after unlink_tasks removes a dependency Problem: unlink_tasks() removes a parent→child dependency edge but does not trigger recompute_ready(). A child whose last blocking parent is unlinked stays stuck in 'todo' indefinitely — it only promotes to 'ready' on the next dispatcher tick or a manual 'hermes kanban recompute'. For CLI-only users without a dispatcher, the child is permanently stuck. Root cause: complete_task() and unblock_task() both call recompute_ready() after their write transaction so downstream children are evaluated immediately. unlink_tasks() was missing this call — removing a dependency is semantically equivalent to completing one, so the same recompute is needed. Fix: Capture the rowcount result before the write_txn exits, then call recompute_ready(conn) outside the transaction when a row was actually deleted (so the child sees the updated task_links state). Tests: Added test_unlink_tasks_triggers_recompute_ready in tests/hermes_cli/test_kanban_db.py: creates parent A (done) + parent C (running), child B with both parents (todo), unlinks C→B, asserts B is ready immediately. Stash-verified: FAILS without fix (child stays todo), PASSES with fix. 62/62 tests green in tests/hermes_cli/test_kanban_db.py. Closes #22459.	2026-05-09 11:06:21 -07:00
kshitij	2a7047c2ed	fix(sqlite): fall back to journal_mode=DELETE on NFS/SMB/FUSE (#22043 ) SQLite's WAL mode requires shared-memory (mmap) coordination and fcntl byte-range locks that don't reliably work on network filesystems. Upstream documents this explicitly: https://www.sqlite.org/wal.html#sometimes_queries_return_sqlite_busy_in_wal_mode On NFS / SMB / some FUSE mounts / WSL1, 'PRAGMA journal_mode=WAL' raises 'sqlite3.OperationalError: locking protocol' (SQLITE_PROTOCOL). Before this change, every feature backed by state.db or kanban.db broke silently: - /resume, /title, /history, /branch returned 'Session database not available.' with no cause - gateway logged the init failure at DEBUG (invisible in errors.log) - kanban dispatcher crashed every 60s, driving the known migration race (duplicate column name: consecutive_failures, #21708 / #21374) Changes: - hermes_state.apply_wal_with_fallback(): shared helper that tries WAL and falls back to DELETE on SQLITE_PROTOCOL-style errors with one WARNING explaining why - hermes_state.get_last_init_error() + format_session_db_unavailable(): capture the init failure cause and surface it in user-facing strings (with an NFS/SMB pointer for 'locking protocol') - hermes_cli/kanban_db.connect(): use the shared helper - gateway/run.py: bump SessionDB init failure log DEBUG -> WARNING (matches cli.py's existing correct behavior) - cli.py (4 sites) + gateway/run.py (5 sites): replace bare 'Session database not available.' with format_session_db_unavailable() Tests: 12 new tests in tests/test_hermes_state_wal_fallback.py + 1 new test in tests/hermes_cli/test_kanban_db.py. Existing suites (state, kanban, gateway, cli) remain green for all tests unrelated to pre-existing failures on main. Evidence: real-world user on NFSv3 mount (172.26.224.200:d2dfac12/home, local_lock=none) reporting 'Session database not available.' on /resume; 'locking protocol' appears in 4 distinct log entries across backup, kanban, TUI, and CLI paths in the same session. closes #22032	2026-05-09 02:09:35 -07:00

1 2

82 commits