mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-06-24 10:52:21 +00:00
When a recurring job's execution time exceeds `interval + grace`, the
scheduler entered a perpetual "missed → fast-forward → skip" loop and the
job effectively never ran again. A real job (`hermes-upstream-contribution`)
logged 42 consecutive "missed" events over 9 hours without executing once.
Timeline (5-min interval, 150s grace, ~15-min execution):
14:00 due → advance next_run_at→14:05 → run (blocks 15 min)
14:15 finishes
14:16 tick: next_run_at=14:05, elapsed 660s > grace 150s → "missed!"
→ fast-forward to 14:21 → continue (SKIP) → does NOT run
... repeats forever for any job whose runtime > interval+grace.
The `continue` (skip execution) in `_get_due_jobs_locked` was designed to
prevent burst-catchup after *gateway downtime* — don't run 6 missed
instances of a 30-min job on restart. But it wrongly applied to a job that
missed its slot because it was *still running*, not because the gateway was
down.
Fix: keep the fast-forward (so accumulated missed slots are still collapsed
to a single next slot — no burst) but fall through to `due.append(job)` so
the job runs ONCE now. The log message is updated to be honest about the new
behavior ("Running now; next run fast-forwarded to: ...").
Behavior note: a recurring job missed during gateway downtime now also fires
once immediately on restart (rather than waiting for its next natural slot).
This is the intended trade-off — the same "run once, don't burst" rule now
applies uniformly to both downtime-misses and long-execution-misses.
Salvaged from #33318 by @liuhao1024 (authorship preserved). Also addresses
the diagnosis in #33361 (@agent-trivi), which proposed the same one-line fix.
Tests: updates `test_stale_past_due_skipped` →
`test_stale_past_due_runs_once_and_fast_forwards` (the old test encoded the
skip behavior); adds `test_long_execution_does_not_perpetually_defer` as a
direct regression for the production loop; updates the F2e timezone test that
relied on the old skip path. Full tests/cron/ suite: 510 passed.
Fixes #33315
Co-authored-by: liuhao1024 <sunsky.lau@gmail.com>
|
||
|---|---|---|
| .. | ||
| __init__.py | ||
| conftest.py | ||
| test_blueprint_catalog.py | ||
| test_claim_job_for_fire.py | ||
| test_codex_execution_paths.py | ||
| test_compute_next_run_last_run_at.py | ||
| test_cron_context_from.py | ||
| test_cron_inactivity_timeout.py | ||
| test_cron_no_agent.py | ||
| test_cron_prompt_injection_skill.py | ||
| test_cron_script.py | ||
| test_cron_workdir.py | ||
| test_cronjob_schema.py | ||
| test_file_permissions.py | ||
| test_jobs.py | ||
| test_jobs_changed_notify.py | ||
| test_jobs_crossprocess_lock.py | ||
| test_parallel_pool.py | ||
| test_rewrite_skill_refs.py | ||
| test_run_one_job.py | ||
| test_scheduler.py | ||
| test_scheduler_mcp_init.py | ||
| test_scheduler_provider.py | ||
| test_suggestions.py | ||