hermes-agent

mirror of https://github.com/NousResearch/hermes-agent.git synced 2026-06-23 10:42:00 +00:00

History

Teknium 1fc8733a69 fix(kanban): unify failure counter across spawn/timeout/crash outcomes (#20410 ) The dispatcher's circuit breaker only protected against spawn-side failures (profile missing, workspace mount error, exec failure). Workers that successfully spawned but then timed out or crashed re-queued to ``ready`` with no counter increment, so the next tick re-spawned them — loops forever until someone noticed. Reported externally on Twitter (Forbidden Seeds) and confirmed by walking the kernel: ``enforce_max_runtime`` flipped the task back to ready, emitted a ``timed_out`` event, and never touched ``spawn_failures``; same for ``detect_crashed_workers``. Fix: unify the counter across all non-success outcomes. Schema ------ * ``tasks.spawn_failures`` → ``tasks.consecutive_failures`` * ``tasks.last_spawn_error`` → ``tasks.last_failure_error`` * Migration renames the columns in-place on existing DBs (``ALTER TABLE RENAME COLUMN`` — SQLite >= 3.25) so historical counter values are preserved. Row mappers fall through to the legacy names if both column renames and a migration somehow got out of sync. Counter lifecycle ----------------- New helper ``_record_task_failure(conn, task_id, error, , outcome, release_claim, end_run, event_payload_extra)`` is the single point every non-success outcome funnels through: ``spawn_failed`` → ``_record_spawn_failure`` (kept as alias) calls it with ``release_claim=True, end_run=True`` — transitions running→ready, clears claim, closes run. * ``timed_out`` → ``enforce_max_runtime`` already does the status transition + run close + event emission, then calls ``_record_task_failure`` with ``release_claim=False, end_run=False`` just to bump the counter (and trip the breaker if needed). * ``crashed`` → ``detect_crashed_workers`` same pattern, but the counter increment runs after the main write_txn closes (SQLite doesn't nest write transactions). If the counter hits the breaker threshold (``DEFAULT_FAILURE_LIMIT=5``, same as before), the task transitions to ``blocked`` with a ``gave_up`` event on top of whatever outcome-specific event was already emitted. Reset semantics changed: the counter now clears only on successful ``complete_task`` (and operator ``reclaim_task`` — an explicit "I've looked at this, try again with a fresh budget"). Previously ``_clear_spawn_failures`` ran on every successful spawn, which would have wiped the counter before a timeout could accumulate past threshold — exactly the loop this fix prevents. Diagnostics ----------- * ``_rule_repeated_spawn_failures`` → ``_rule_repeated_failures``. Now fires regardless of which outcome is at fault. Classifies the most recent failure (spawn_failed / timed_out / crashed) from the run history so the title ("Agent timeout x3", "Agent crash x4", "Agent spawn x5") and suggested action (``doctor`` for spawn, ``log`` for timeout/crash) stay outcome-specific without N duplicate rules. * ``_rule_repeated_crashes`` kept as a narrower early-warning at threshold 2 (vs 3 for the unified rule), but now suppresses itself when the unified rule would also fire — avoids double-flagging. * Diagnostic ``data`` payload now carries ``{consecutive_failures, most_recent_outcome, last_error}`` instead of spawn-specific keys. CLI --- * ``Task.consecutive_failures`` / ``Task.last_failure_error`` are the public fields now. Existing callers that referenced the old names get migrated (tests updated in this commit). * Backward-compat: ``DEFAULT_SPAWN_FAILURE_LIMIT``, ``_clear_spawn_failures``, ``_record_spawn_failure`` stay as aliases. Tests ----- * 6 new kernel tests: timeout increments counter, 3 consecutive timeouts trip the breaker (was the reported gap), crash increments counter, reclaim clears counter, completion clears counter, spawn success does NOT clear counter. * Diagnostic tests: updated ``repeated_spawn_failures`` cases to use the new kind name and add a timeout-loop test. * Dashboard API test: spawn_failures column update → consecutive_failures. 389/389 kanban-suite tests pass. Live verification ----------------- Seeded 4 tasks in an isolated HERMES_HOME: 3 timeouts, 4 crashes, 2-spawn-failed + 2-timed-out, and a task that had prior failures but completed successfully. Board correctly shows "!! 3 tasks need attention" (the successful one has no badge because the counter reset). Drawer for the timeout-loop task renders "Agent timeout x3" with most_recent_outcome=timed_out and the "Check logs" suggested action (not the spawn-flavoured "Verify profile"). The successful task has zero diagnostics. Closes the Forbidden-Seeds-reported gap.		2026-05-05 13:55:37 -07:00
..
__init__.py	fix(windows): enforce UTF-8 stdout/stderr to prevent UnicodeEncodeError crash	2026-05-03 16:58:25 -07:00
_parser.py	refactor(cli): derive relaunch flag table from argparse introspection	2026-04-29 20:33:29 -07:00
auth.py	feat: provider modules — ProviderProfile ABC, 33 providers, fetch_models, transport single-path	2026-05-05 13:40:01 -07:00
auth_commands.py	feat(nous): persist Nous OAuth across profiles via shared token store (#19712 )	2026-05-04 04:54:55 -07:00
azure_detect.py	chore: remove unused imports and dead locals (ruff F401, F841) (#17010 )	2026-04-28 06:46:45 -07:00
backup.py	fix(backup): floor pre-update backup_keep to 1 so the new backup survives	2026-05-04 05:07:13 -07:00
banner.py	fix(banner): show correct update status on nix-built hermes (#17550 )	2026-04-30 07:03:00 +05:30
browser_connect.py	fix(browser): address Copilot review on /browser connect	2026-04-28 22:11:10 -07:00
callbacks.py	fix: ESC cancels secret/sudo prompts, clearer skip messaging (#9902 )	2026-04-14 16:11:37 -07:00
claw.py	fix(claw): handle missing dir in _scan_workspace_state	2026-05-05 06:08:14 -07:00
cli_output.py	refactor: remove dead code — 1,784 lines across 77 files (#9180 )	2026-04-13 16:32:04 -07:00
clipboard.py	feat: fix img pasting in new ink plus newline after tools	2026-04-11 13:14:32 -05:00
codex_models.py	feat(codex): add gpt-5.5 and wire live model discovery into picker (#14720 )	2026-04-23 13:32:43 -07:00
colors.py	feat: respect NO_COLOR env var and TERM=dumb (#4079 )	2026-03-30 17:07:21 -07:00
commands.py	feat(telegram): /topic off + help + auth gate + screenshot debounce	2026-05-04 12:07:17 -07:00
completion.py	fix: preserve profile name completion in dynamic shell completion	2026-04-14 10:45:42 -07:00
config.py	docs(bedrock): fix IAM permissions, add quickstart entry, add fallback provider, fix deployment section	2026-05-05 13:41:14 -07:00
copilot_auth.py	fix(copilot): exchange raw GitHub token for Copilot API JWT	2026-04-24 05:09:08 -07:00
cron.py	feat(cron): add no_agent mode for script-only cron jobs (watchdog pattern) (#19709 )	2026-05-04 12:31:01 -07:00
curator.py	feat(curator): add archive and prune subcommands (#20200 )	2026-05-05 05:15:54 -07:00
curses_ui.py	fix: treat ctrl-c as curses cancel	2026-05-04 01:36:44 -07:00
debug.py	fix(debug): redact log content at upload time in hermes debug share	2026-05-03 11:42:20 -07:00
default_soul.py	fix: reset default SOUL.md to baseline identity text (#3159 )	2026-03-26 01:34:27 -07:00
dingtalk_auth.py	chore: remove unused imports and dead locals (ruff F401, F841) (#17010 )	2026-04-28 06:46:45 -07:00
doctor.py	feat: provider modules — ProviderProfile ABC, 33 providers, fetch_models, transport single-path	2026-05-05 13:40:01 -07:00
dump.py	refactor(env): use shared Hermes dotenv loader	2026-05-05 10:13:13 -07:00
env_loader.py	refactor: consolidate symlink-safe atomic replace into shared helper	2026-04-28 04:58:22 -07:00
fallback_cmd.py	feat(cli): add 'hermes fallback' command to manage fallback providers (#16052 )	2026-04-26 06:19:04 -07:00
gateway.py	fix(gateway): handle planned service stops	2026-05-04 16:00:49 -07:00
goals.py	feat: /goal — persistent cross-turn goals (Ralph loop) (#18262 )	2026-04-30 23:10:20 -07:00
hooks.py	chore: remove unused imports and dead locals (ruff F401, F841) (#17010 )	2026-04-28 06:46:45 -07:00
kanban.py	feat(kanban): generic diagnostics engine for task distress signals (#20332 )	2026-05-05 13:32:42 -07:00
kanban_db.py	fix(kanban): unify failure counter across spawn/timeout/crash outcomes (#20410 )	2026-05-05 13:55:37 -07:00
kanban_diagnostics.py	fix(kanban): unify failure counter across spawn/timeout/crash outcomes (#20410 )	2026-05-05 13:55:37 -07:00
logs.py	feat: component-separated logging with session context and filtering (#7991 )	2026-04-11 17:23:36 -07:00
main.py	fix(tui): close slash parity gaps with CLI (#20339 )	2026-05-05 15:42:39 -05:00
mcp_config.py	refactor(config): migrate remaining 33 cfg_get call sites (#17311 )	2026-04-29 04:03:03 -07:00
memory_setup.py	fix(cli): decode .env as UTF-8 to avoid GBK crash on Windows	2026-05-02 01:40:31 -07:00
model_catalog.py	chore: remove unused imports and dead locals (ruff F401, F841) (#17010 )	2026-04-28 06:46:45 -07:00
model_normalize.py	feat(minimax-oauth): full integration with peer OAuth providers	2026-04-29 09:53:42 -07:00
model_switch.py	feat(cli): add list_picker_providers for credential-filtered picker	2026-05-05 10:18:58 -07:00
models.py	feat: provider modules — ProviderProfile ABC, 33 providers, fetch_models, transport single-path	2026-05-05 13:40:01 -07:00
nous_subscription.py	fix(cli): coerce use_gateway config flags in tool routing	2026-04-26 19:02:55 -07:00
oneshot.py	fix(tui): honor launch toolsets (#17623 )	2026-04-29 16:55:27 -07:00
pairing.py	fix(pairing): handle null user_name in pairing list display	2026-04-23 02:34:11 -07:00
platforms.py	feat: complete plugin platform parity — all 12 integration points	2026-04-29 21:56:51 -07:00
plugins.py	feat(providers): make all 33 providers pluggable under plugins/model-providers/	2026-05-05 13:40:01 -07:00
plugins_cmd.py	feat(dashboard): add Plugins page with enable/disable, auth status, install/remove	2026-04-30 20:29:37 -04:00
profiles.py	fix(profiles): keep validate_profile_name strict; callers normalize first	2026-05-04 04:44:37 -07:00
providers.py	fix: prevent bare 'custom' slug in model.provider (#17478 )	2026-04-30 04:32:11 -07:00
pty_bridge.py	fix(pty): default TERM for resize probes	2026-05-04 02:38:54 -07:00
relaunch.py	remove relaunch_chat	2026-04-29 20:33:29 -07:00
runtime_provider.py	fix(fallback): let custom_providers shadow built-in aliases	2026-04-30 20:18:44 -07:00
setup.py	fix(cli): sanitize bracketed paste markers during setup	2026-05-05 06:12:42 -07:00
skills_config.py	refactor(config): migrate remaining 33 cfg_get call sites (#17311 )	2026-04-29 04:03:03 -07:00
skills_hub.py	feat(skills): install skills from a direct HTTP(S) URL (#16323 )	2026-04-26 20:57:10 -07:00
skin_engine.py	fix(tui): restore macOS copy behavior and theme polish (#17131 )	2026-04-28 18:47:14 -05:00
slack_cli.py	fix(paths): route achievements plugin + profile-tui through HERMES_HOME	2026-04-30 23:21:54 -07:00
status.py	fix(status): add missing popular provider API keys to hermes status display	2026-05-04 05:14:13 -07:00
timeouts.py	refactor(timeouts): drop redundant ImportError in except clause	2026-04-26 20:48:20 -07:00
tips.py	docs: refresh stale platform/LOC/test counts; clarify gateway vs plugin platforms	2026-05-05 13:45:47 -07:00
tools_config.py	fix(cli): sync use_gateway in _reconfigure_provider for tts, browser, and web	2026-05-04 02:33:55 -07:00
uninstall.py	feat(uninstall): offer to remove named profiles when uninstalling from default	2026-04-18 19:18:13 -07:00
vercel_auth.py	feat: add Vercel Sandbox backend	2026-04-29 07:22:33 -07:00
voice.py	fix(tui): respect voice.record_key config (supersedes #19028 , #19339 ) (#19835 )	2026-05-04 15:49:28 -07:00
web_server.py	feat(docker): launch dashboard as side-process via HERMES_DASHBOARD=1	2026-05-04 15:37:27 +10:00
webhook.py	refactor(config): migrate remaining 33 cfg_get call sites (#17311 )	2026-04-29 04:03:03 -07:00