mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-06-17 09:41:58 +00:00
`hermes cron pause`/`resume`/`remove` run in their own CLI process (CLI → cronjob tool → pause_job → update_job → save_jobs), entirely separate from the gateway process that also writes jobs.json (mark_job_run, advance_next_run, due-fast-forward in get_due_jobs). The only synchronization was a module-level `threading.Lock`, which serializes writers *within a single process* but does nothing across processes — and update_job/pause_job/remove_job/create_job did not even take it. The result is a classic lost update: a `cron pause` issued while the gateway is live loads jobs.json, sets enabled=False, and saves; concurrently the gateway loads the same file and saves back its run-bookkeeping, clobbering the pause. The CLI prints "Paused" (it succeeded against its own in-memory copy) but the job stays enabled and keeps firing, with no error surfaced. The scheduler's `.tick.lock` flock can't be reused for this — it is held for the entire tick, including multi-minute agent runs, so a CLI mutation would block for minutes. Add `_jobs_lock()`: a short-held cross-process advisory file lock (fcntl/msvcrt flock on `<hermes_home>/cron/.jobs.lock`) layered over the existing in-process lock, and wrap every load→modify→save critical section with it — create_job, update_job, remove_job, mark_job_run, advance_next_run, get_due_jobs, rewrite_skill_refs. The lock degrades to in-process-only if neither fcntl nor msvcrt is available, preserving prior behaviour. All critical sections are short (field edits, no agent execution), so contention resolves in milliseconds. Adds a regression test that proves the lock excludes a second process (an in-process threading.Lock cannot). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> |
||
|---|---|---|
| .. | ||
| scripts | ||
| __init__.py | ||
| blueprint_catalog.py | ||
| jobs.py | ||
| scheduler.py | ||
| suggestion_catalog.py | ||
| suggestions.py | ||