fix(delegation): increase heartbeat stale thresholds

The heartbeat stale detection was too aggressive:
- idle: 5 * 30s = 150s — LLM inference on slow providers (Zhipu/GLM)
  frequently exceeds 150s, causing heartbeat to stop prematurely
- in-tool: 20 * 30s = 600s — borderline for long tool calls

When heartbeat stops, parent._last_activity_ts freezes, eventually
triggering gateway timeout and killing the entire delegation.

New thresholds:
- idle: 15 * 30s = 450s — accommodates slow LLM inference
- in-tool: 40 * 30s = 1200s — accommodates long-running tool calls

child_timeout_seconds (config: delegation.child_timeout_seconds) remains
the hard cap for total delegation duration.
This commit is contained in:
阿泥豆 2026-04-26 14:37:12 +08:00 committed by Teknium
parent 6b4ccb9b14
commit 0cc63043e0

View file

@ -483,8 +483,8 @@ _HEARTBEAT_INTERVAL = 30 # seconds between parent activity heartbeats during de
# The idle ceiling stays tight so genuinely stuck children don't mask the gateway
# timeout. The in-tool ceiling is much higher so legit long-running tools get
# time to finish; child_timeout_seconds (default 600s) is still the hard cap.
_HEARTBEAT_STALE_CYCLES_IDLE = 5 # 5 * 30s = 150s idle between turns → stale
_HEARTBEAT_STALE_CYCLES_IN_TOOL = 20 # 20 * 30s = 600s stuck on same tool → stale
_HEARTBEAT_STALE_CYCLES_IDLE = 15 # 15 * 30s = 450s idle between turns → stale
_HEARTBEAT_STALE_CYCLES_IN_TOOL = 40 # 40 * 30s = 1200s stuck on same tool → stale
DEFAULT_TOOLSETS = ["terminal", "file", "web"]