hermes-agent/tests
Brian D. Evans 24461ec0bc fix(error-classifier): route 429 'overloaded' messages to FailoverReason.overloaded (#15297)
Some providers — notably Z.AI / Zhipu — return HTTP 429 with messages
like "The service may be temporarily overloaded, please try again later"
to signal **server-wide overload**, not per-credential rate limiting.
The two conditions need different recovery strategies:

| Reason | Correct strategy |
|---|---|
| ``rate_limit`` | Retry same credential once (the limit may have reset), then rotate |
| ``overloaded`` | Skip retry, rotate immediately (the whole endpoint is saturated) |

Before this fix:

- ``error_classifier.py`` mapped every 429 to ``FailoverReason.rate_limit``
  regardless of message body.
- ``FailoverReason.overloaded`` already existed as an enum value (and was
  produced for 503/529) but no production path emitted it for 429.
- ``_recover_with_credential_pool`` had no handler for ``overloaded`` —
  an ``overloaded`` classification fell through to the default no-op
  ``return False, has_retried_429`` line.

Net effect: every overload-language 429 burned a ``has_retried_429``
slot on the same saturated credential before the rotation happened, and
cron jobs (one turn each) often used their entire execution on that
wasted retry.

### Fix

Two narrow, additive changes:

1. ``agent/error_classifier.py`` — new ``_OVERLOADED_PATTERNS`` list
   containing provider-language overload phrases (``"temporarily
   overloaded"``, ``"server is overloaded"``, ``"at capacity"``, …).
   The 429 branch now checks the error body against this list and
   emits ``FailoverReason.overloaded`` when matched, preserving
   ``rate_limit`` for everything else.  Kept phrases narrow and
   provider-language-flavoured so a normal rate-limit message
   ("you have been rate-limited") doesn't hit this bucket.

2. ``run_agent.py::_recover_with_credential_pool`` — new branch for
   ``effective_reason == FailoverReason.overloaded``.  Same shape as
   the existing ``billing`` handler: rotate immediately via
   ``mark_exhausted_and_rotate(...)``, no retry-on-same-credential
   first.  The 503/529 ``overloaded`` classifications produced by the
   existing code now also flow through this branch.

### Tests (15 new, all passing on py3.11 venv)

``tests/agent/test_error_classifier.py`` — 7 cases:

- The exact #15297 Z.AI message → ``overloaded``
- 9-phrase parametrised matrix (server/service/upstream overloaded,
  at/over capacity, "currently overloaded", …) → all ``overloaded``
- Plain "Rate limit reached for requests per minute" 429 → still
  ``rate_limit`` (regression guard against the disambiguation
  silently broadening)
- Plain "Too Many Requests" 429 → still ``rate_limit``
- 503 path unchanged → still ``overloaded`` (negative control)

``tests/agent/test_credential_pool_routing.py`` —
``TestPoolOverloadedRotation`` (4 cases):

- Overloaded on first failure → rotates immediately (the fix)
- Overloaded with ``has_retried_429=True`` → still rotates (no retry-
  flag dependence — the rate-limit gate is specific to ``rate_limit``)
- Single-entry pool overload → ``recovered=False`` so outer fallback
  takes over rather than spinning
- ``rate_limit`` on first failure → still uses retry-first path
  (regression guard against the new branch broadening)

**Verified regression guards**: temporarily reverted the classifier's
overload branch; 10 of the 11 new classifier tests correctly failed
with clear assertion messages.  Restored fix → all 145 tests pass
(``test_error_classifier.py`` + ``test_credential_pool_routing.py``,
existing + new).

Closes #15297

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 16:39:48 -07:00
..
acp fix(acp): include MCP toolsets in ACP sessions 2026-04-24 03:04:42 -07:00
agent fix(error-classifier): route 429 'overloaded' messages to FailoverReason.overloaded (#15297) 2026-04-24 16:39:48 -07:00
cli feat: add slash command for busy input mode 2026-04-24 15:15:26 -07:00
cron feat(cron): per-job workdir for project-aware cron runs (#15110) 2026-04-24 05:07:01 -07:00
e2e refactor(commands): drop /provider, /plan handler, and clean up slash registry (#15047) 2026-04-24 03:10:52 -07:00
environments/benchmarks fix(security): consolidated security hardening — SSRF, timing attack, tar traversal, credential leakage (#5944) 2026-04-07 17:28:37 -07:00
fakes
gateway fix(gateway/bluebubbles): align iMessage delivery with non-editable UX 2026-04-24 16:04:37 -07:00
hermes_cli fix: mobile chat in new layout 2026-04-24 12:07:46 -04:00
hermes_state fix(resume): redirect --resume to the descendant that actually holds the messages 2026-04-24 03:04:42 -07:00
honcho_plugin feat(honcho): wizard cadence default 2, surface reasoning level, backwards-compat fallback 2026-04-18 22:50:55 -07:00
integration fix(discord): strip RTP padding before DAVE/Opus decode (#11267) 2026-04-16 16:50:15 -07:00
plugins feat(hindsight): optional bank_id_template for per-agent / per-user banks 2026-04-24 03:38:17 -07:00
run_agent fix(memory): skip external-provider sync on interrupted turns (#15218) 2026-04-24 15:30:18 -07:00
skills fix(google-workspace): normalize authorized user token writes 2026-04-16 04:22:16 -07:00
tools fix(env): safely quote ~/ subpaths in wrapped cd commands 2026-04-24 15:25:12 -07:00
tui_gateway fix(tui): keep default personality neutral 2026-04-24 16:19:23 -05:00
__init__.py
conftest.py test(conftest): reset module-level state + unset platform allowlists (#13400) 2026-04-21 01:33:10 -07:00
run_interrupt_test.py
test_account_usage.py feat(account-usage): add per-provider account limits module 2026-04-21 01:56:35 -07:00
test_base_url_hostname.py security(runtime_provider): close OLLAMA_API_KEY substring-leak sweep miss (#13522) 2026-04-21 06:06:16 -07:00
test_batch_runner_checkpoint.py test: regression coverage for checkpoint dedup and inf/nan coercion 2026-04-24 14:32:21 -07:00
test_cli_file_drop.py fix(tui): improve macOS paste and shortcut parity 2026-04-21 08:00:00 -07:00
test_cli_skin_integration.py fix: align status bar skin tests with upstream main 2026-04-22 13:20:02 -07:00
test_ctx_halving_fix.py fix(tests): fix 78 CI test failures and remove dead test (#9036) 2026-04-13 10:50:24 -07:00
test_empty_model_fallback.py fix: fall back to provider's default model when model config is empty (#8303) 2026-04-12 03:53:30 -07:00
test_evidence_store.py
test_hermes_constants.py fix(gateway): harden Docker/container gateway pathway 2026-04-12 16:36:11 -07:00
test_hermes_logging.py fix(tests): fix 78 CI test failures and remove dead test (#9036) 2026-04-13 10:50:24 -07:00
test_hermes_state.py feat(dashboard): track real API call count per session 2026-04-22 05:51:58 -07:00
test_honcho_client_config.py
test_ipv4_preference.py feat: add network.force_ipv4 config to fix IPv6 timeout issues (#8196) 2026-04-11 23:12:11 -07:00
test_mcp_serve.py
test_mini_swe_runner.py fix(kimi): omit temperature entirely for Kimi/Moonshot models (#13157) 2026-04-20 12:23:05 -07:00
test_minimax_model_validation.py fix(models): validate MiniMax models against static catalog (#12611, #12460, #12399, #12547) 2026-04-19 22:44:47 -07:00
test_minisweagent_path.py
test_model_picker_scroll.py fix: CLI/UX batch — ChatConsole errors, curses scroll, skin-aware banner, git state banner (#5974) 2026-04-07 17:59:42 -07:00
test_model_tools.py test: regression coverage for checkpoint dedup and inf/nan coercion 2026-04-24 14:32:21 -07:00
test_model_tools_async_bridge.py fix(core): ensure non-blocking executor shutdown on async timeout 2026-04-22 14:42:32 -07:00
test_ollama_num_ctx.py fix: provider/model resolution — salvage 4 PRs + MiniMax aux URL fix (#5983) 2026-04-07 22:23:28 -07:00
test_packaging_metadata.py
test_plugin_skills.py fix(tests): attach caplog to specific logger in 3 order-dependent tests (#11453) 2026-04-17 00:20:40 -07:00
test_project_metadata.py build(deps): add qrcode to dingtalk + feishu extras (parity with messaging) (#11627) 2026-04-17 13:31:53 -07:00
test_retry_utils.py feat(agent): add jittered retry backoff 2026-04-08 00:41:36 -07:00
test_sql_injection.py
test_subprocess_home_isolation.py fix: per-profile subprocess HOME isolation (#4426) (#7357) 2026-04-10 13:37:45 -07:00
test_timezone.py test: speed up slow tests (backoff + subprocess + IMDS network) (#11797) 2026-04-17 14:21:22 -07:00
test_toolset_distributions.py
test_toolsets.py fix(ci): unblock test suite + cut ~2s of dead Z.AI probes from every AIAgent 2026-04-19 19:18:19 -07:00
test_trajectory_compressor.py fix(kimi): omit temperature entirely for Kimi/Moonshot models (#13157) 2026-04-20 12:23:05 -07:00
test_trajectory_compressor_async.py fix(kimi): omit temperature entirely for Kimi/Moonshot models (#13157) 2026-04-20 12:23:05 -07:00
test_transform_tool_result_hook.py test: stop testing mutable data — convert change-detectors to invariants (#13363) 2026-04-20 23:20:33 -07:00
test_tui_gateway_server.py feat(tui): per-section visibility for the details accordion 2026-04-24 02:34:32 -05:00
test_utils_truthy_values.py