mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-05-08 03:01:47 +00:00
When a kanban worker subprocess exits rc=0 but its task is still in status='running', the agent almost certainly answered the task conversationally without calling kanban_complete or kanban_block. The dispatcher used to classify this as a generic crash and respawn, which loops forever on small local models (gemma4-e2b q4 etc.) that keep returning clean but unproductive output. Dispatcher changes: - The waitpid reap loop at the top of dispatch_once now records each reaped child's raw exit status in a bounded module registry (_recent_worker_exits, TTL 600s, size cap 4096). - _classify_worker_exit distinguishes clean_exit / nonzero_exit / signaled / unknown using os.WIFEXITED / WIFSIGNALED. - detect_crashed_workers consults the classification when a worker is found dead. clean_exit → protocol_violation event + immediate circuit-breaker trip (failure_limit=1). Everything else keeps the existing crashed-event + counter behavior. - DispatchResult.auto_blocked now includes protocol-violation trips. Gateway fix (Bug A in #20894): - gateway.run._notify_active_sessions_of_shutdown snapshots self.adapters with list(...) before iterating. adapter.send() can hit a fatal-error path that pops the adapter from the dict, which was raising 'RuntimeError: dictionary changed size during iteration' during shutdown. Regression tests: - test_detect_crashed_workers_protocol_violation_auto_blocks verifies rc=0 + still-running → status=blocked on first occurrence with protocol_violation + gave_up events and NO crashed event. - test_detect_crashed_workers_nonzero_exit_uses_default_limit verifies non-zero exits keep the existing 2-strike behavior. Closes #20894.
This commit is contained in:
parent
699c770e5c
commit
fdb9e0f6a6
3 changed files with 255 additions and 14 deletions
|
|
@ -2521,7 +2521,12 @@ class GatewayRunner:
|
|||
platform_str, chat_id, e,
|
||||
)
|
||||
|
||||
for platform, adapter in self.adapters.items():
|
||||
# Snapshot adapters up front: adapter.send() can hit a fatal error
|
||||
# path that pops the adapter from self.adapters (see _handle_fatal
|
||||
# elsewhere), which would otherwise trigger
|
||||
# ``RuntimeError: dictionary changed size during iteration`` —
|
||||
# observed in a user report during gateway shutdown.
|
||||
for platform, adapter in list(self.adapters.items()):
|
||||
home = self.config.get_home_channel(platform)
|
||||
if not home or not home.chat_id:
|
||||
continue
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue