mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-05-08 03:01:47 +00:00
fix(kanban): heartbeat tool extends claim TTL, not just last_heartbeat_at
The kanban_heartbeat tool called heartbeat_worker but never heartbeat_claim, so a worker that loops the tool while a single tool call blocks the agent for >DEFAULT_CLAIM_TTL_SECONDS still got reclaimed by release_stale_claims. The function name and heartbeat_claim's own docstring imply otherwise: "Workers that know they'll exceed 15 minutes should call this every few minutes to keep ownership." But there was no caller in the worker tool path. Workers couldn't invoke heartbeat_claim themselves either — it isn't exposed as a tool. Fix: _handle_heartbeat now calls heartbeat_claim first, reading HERMES_KANBAN_CLAIM_LOCK from the worker env (the dispatcher pins this in _default_spawn). Falls back to _claimer_id() for locally- driven workers that didn't go through dispatcher spawn. Test: tests/tools/test_kanban_tools.py::test_heartbeat_extends_claim_expires rewinds claim_expires into the past, calls the tool, and asserts the new value is at least now + DEFAULT_CLAIM_TTL_SECONDS // 2. Verified to fail against the unfixed code (claim_expires stays at the rewound value). Closes the root cause underlying the symptom in #21141 (15-min respawns of long-running workers). #21141 separately addresses post-reclaim cleanup; this fixes the upstream "shouldn't have been reclaimed in the first place" half.
This commit is contained in:
parent
bf843adf05
commit
40b51c93a2
2 changed files with 72 additions and 1 deletions
|
|
@ -315,7 +315,15 @@ def _handle_block(args: dict, **kw) -> str:
|
|||
|
||||
|
||||
def _handle_heartbeat(args: dict, **kw) -> str:
|
||||
"""Signal that the worker is still alive during a long operation."""
|
||||
"""Signal that the worker is still alive during a long operation.
|
||||
|
||||
Extends the claim TTL via ``heartbeat_claim`` AND records a heartbeat
|
||||
event via ``heartbeat_worker``. Without the ``heartbeat_claim`` half,
|
||||
a diligent worker that loops this tool while a single tool call
|
||||
blocks the agent for >DEFAULT_CLAIM_TTL_SECONDS still gets reclaimed
|
||||
by ``release_stale_claims`` — which is exactly the trap that
|
||||
``heartbeat_claim``'s docstring warns against.
|
||||
"""
|
||||
tid = _default_task_id(args.get("task_id"))
|
||||
if not tid:
|
||||
return tool_error(
|
||||
|
|
@ -328,6 +336,14 @@ def _handle_heartbeat(args: dict, **kw) -> str:
|
|||
try:
|
||||
kb, conn = _connect()
|
||||
try:
|
||||
# Extend the claim TTL first. The dispatcher pins
|
||||
# HERMES_KANBAN_CLAIM_LOCK in the worker env at spawn time
|
||||
# (see _default_spawn in kanban_db.py); falling back to the
|
||||
# default _claimer_id() covers locally-driven workers that
|
||||
# never went through the dispatcher path.
|
||||
claim_lock = os.environ.get("HERMES_KANBAN_CLAIM_LOCK")
|
||||
kb.heartbeat_claim(conn, tid, claimer=claim_lock)
|
||||
|
||||
ok = kb.heartbeat_worker(
|
||||
conn,
|
||||
tid,
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue