hermes-agent/tests/run_agent
Teknium a5bd56eae3
fix: eliminate provider hang dead zones in retry/timeout architecture (#8985)
Three targeted changes to close the gaps between retry layers that
caused users to experience 'No response from provider for 580s' and
'No activity for 15 minutes' despite having 5 layers of retry:

1. Remove non-streaming fallback from streaming path

   Previously, when all 3 stream retries exhausted, the code fell back
   to _interruptible_api_call() which had no stale detection and no
   activity tracking — a black hole that could hang for up to 1800s.
   Now errors propagate to the main retry loop which has richer recovery
   (credential rotation, provider fallback, backoff).

   For 'stream not supported' errors, sets _disable_streaming flag so
   the main retry loop automatically switches to non-streaming on the
   next attempt.

2. Add _touch_activity to recovery dead zones

   The gateway inactivity monitor relies on _touch_activity() to know
   the agent is alive, but activity was never touched during:
   - Stale stream detection/kill cycles (180-300s gaps)
   - Stream retry connection rebuilds
   - Main retry backoff sleeps (up to 120s)
   - Error recovery classification

   Now all these paths touch activity every ~30s, keeping the gateway
   informed during recovery cycles.

3. Add stale-call detector to non-streaming path

   _interruptible_api_call() now has the same stale detection pattern
   as the streaming path: kills hung connections after 300s (default,
   configurable via HERMES_API_CALL_STALE_TIMEOUT), scaled for large
   contexts (450s for 50K+ tokens, 600s for 100K+ tokens), disabled
   for local providers.

   Also touches activity every ~30s during the wait so the gateway
   monitor stays informed.

Env vars:
- HERMES_API_CALL_STALE_TIMEOUT: non-streaming stale timeout (default 300s)
- HERMES_STREAM_STALE_TIMEOUT: unchanged (default 180s)

Before: worst case ~2+ hours of sequential retries with no feedback
After: worst case bounded by gateway inactivity timeout (default 1800s)
with continuous activity reporting
2026-04-13 04:55:20 -07:00
..
__init__.py refactor(tests): re-architect tests + fix CI failures (#5946) 2026-04-07 17:19:07 -07:00
test_413_compression.py fix: clear conversation_history after mid-loop compression to prevent empty sessions (#7001) 2026-04-10 00:14:59 -07:00
test_860_dedup.py refactor(tests): re-architect tests + fix CI failures (#5946) 2026-04-07 17:19:07 -07:00
test_1630_context_overflow_loop.py refactor(tests): re-architect tests + fix CI failures (#5946) 2026-04-07 17:19:07 -07:00
test_agent_guardrails.py fix(delegate): make max_concurrent_children configurable + error on excess 2026-04-10 13:38:14 -07:00
test_agent_loop.py refactor(tests): re-architect tests + fix CI failures (#5946) 2026-04-07 17:19:07 -07:00
test_agent_loop_tool_calling.py refactor(tests): re-architect tests + fix CI failures (#5946) 2026-04-07 17:19:07 -07:00
test_agent_loop_vllm.py refactor(tests): re-architect tests + fix CI failures (#5946) 2026-04-07 17:19:07 -07:00
test_anthropic_error_handling.py refactor(tests): re-architect tests + fix CI failures (#5946) 2026-04-07 17:19:07 -07:00
test_async_httpx_del_neuter.py refactor(tests): re-architect tests + fix CI failures (#5946) 2026-04-07 17:19:07 -07:00
test_compression_boundary.py refactor(tests): re-architect tests + fix CI failures (#5946) 2026-04-07 17:19:07 -07:00
test_compression_feasibility.py test: add tests for compression config_context_length passthrough 2026-04-12 17:52:34 -07:00
test_compression_persistence.py refactor(tests): re-architect tests + fix CI failures (#5946) 2026-04-07 17:19:07 -07:00
test_compressor_fallback_update.py refactor(tests): re-architect tests + fix CI failures (#5946) 2026-04-07 17:19:07 -07:00
test_context_pressure.py fix(agent): tiered context pressure warnings + gateway dedup (#6411) 2026-04-08 21:31:44 -07:00
test_context_token_tracking.py refactor(tests): re-architect tests + fix CI failures (#5946) 2026-04-07 17:19:07 -07:00
test_dict_tool_call_args.py refactor(tests): re-architect tests + fix CI failures (#5946) 2026-04-07 17:19:07 -07:00
test_exit_cleanup_interrupt.py refactor(tests): re-architect tests + fix CI failures (#5946) 2026-04-07 17:19:07 -07:00
test_fallback_model.py fix(model): normalize direct provider ids in auxiliary routing 2026-04-10 05:52:45 -07:00
test_flush_memories_codex.py fix(agent): respect config timeout for flush_memories instead of hardcoded 30s 2026-04-08 18:55:33 -07:00
test_interactive_interrupt.py refactor(tests): re-architect tests + fix CI failures (#5946) 2026-04-07 17:19:07 -07:00
test_interrupt_propagation.py fix: scope tool interrupt signal per-thread to prevent cross-session leaks (#7930) 2026-04-11 14:02:58 -07:00
test_long_context_tier_429.py refactor(tests): re-architect tests + fix CI failures (#5946) 2026-04-07 17:19:07 -07:00
test_openai_client_lifecycle.py refactor(tests): re-architect tests + fix CI failures (#5946) 2026-04-07 17:19:07 -07:00
test_percentage_clamp.py fix: update 6 test files broken by dead code removal 2026-04-10 03:44:43 -07:00
test_primary_runtime_restore.py fix(run_agent): recover primary client on openai transport errors 2026-04-10 03:21:24 -07:00
test_provider_fallback.py refactor(tests): re-architect tests + fix CI failures (#5946) 2026-04-07 17:19:07 -07:00
test_provider_parity.py feat: expand /fast to all OpenAI Priority Processing models (#6960) 2026-04-09 22:06:30 -07:00
test_real_interrupt_subagent.py refactor(tests): re-architect tests + fix CI failures (#5946) 2026-04-07 17:19:07 -07:00
test_redirect_stdout_issue.py refactor(tests): re-architect tests + fix CI failures (#5946) 2026-04-07 17:19:07 -07:00
test_run_agent.py fix: eliminate provider hang dead zones in retry/timeout architecture (#8985) 2026-04-13 04:55:20 -07:00
test_run_agent_codex_responses.py feat(gateway): surface natural mid-turn assistant messages in chat platforms 2026-04-11 16:21:39 -07:00
test_session_meta_filtering.py refactor(tests): re-architect tests + fix CI failures (#5946) 2026-04-07 17:19:07 -07:00
test_session_reset_fix.py refactor(tests): re-architect tests + fix CI failures (#5946) 2026-04-07 17:19:07 -07:00
test_streaming.py fix: eliminate provider hang dead zones in retry/timeout architecture (#8985) 2026-04-13 04:55:20 -07:00
test_strict_api_validation.py refactor(tests): re-architect tests + fix CI failures (#5946) 2026-04-07 17:19:07 -07:00
test_switch_model_context.py fix: pass config_context_length to switch_model context compressor 2026-04-10 05:52:45 -07:00
test_token_persistence_non_cli.py refactor(tests): re-architect tests + fix CI failures (#5946) 2026-04-07 17:19:07 -07:00
test_tool_arg_coercion.py refactor(tests): re-architect tests + fix CI failures (#5946) 2026-04-07 17:19:07 -07:00
test_unicode_ascii_codec.py fix(unicode): sanitize surrogate metadata and allow two-pass retry 2026-04-10 13:05:01 -07:00