fix(agent): fallback immediately on provider content-policy blocks (#33883)

* fix(agent): fallback immediately on provider content-policy blocks

Provider safety-filter refusals (e.g. OpenAI Codex 'flagged for possible
cybersecurity risk', OpenAI moderation 'violates our usage policies',
Anthropic safety-system rejections, Azure content_filter) are
deterministic decisions about a specific prompt. Retrying the same
prompt up to api_max_retries times just reproduces the same refusal and
burns paid attempts before surfacing the generic 'API failed after 3
retries — <provider message>' to Telegram / cron with no indication that
the failure came from the model provider rather than Hermes itself.

Classify these as a new FailoverReason.content_policy_blocked
(non-retryable, should_fallback=True) and route them through the
existing is_client_error path so the loop:
  - skips the 3x retry backoff
  - activates a configured fallback model immediately
  - emits a clear provider-safety message to the user (not the generic
    'Non-retryable error (HTTP None)') and surfaces actionable guidance
    when no fallback is configured (rephrase, narrow context, or set
    fallback_model in hermes config)
  - returns a final_response that explicitly tells the user this came
    from the model provider, so gateway delivery is unambiguous and
    cron last_status reflects the safety block rather than a vague
    'agent reported failure'

Patterns are intentionally narrow — verbatim refusal phrasings keyed to
specific provider safety pipelines, not generic words like 'policy' or
'violation' that would collide with billing / format / auth errors.
Regression guards in test_18028_content_policy_blocked.py verify
billing 402s, generic 400s, and OpenRouter account-level
provider_policy_blocked remain distinct classifications.

Salvaged from #18164 onto current main (file restructure: loop logic
moved from run_agent.py to agent/conversation_loop.py, _emit_status →
_buffer_status), broadened patterns beyond the original OpenAI Codex
cybersecurity case to cover OpenAI moderation, Anthropic safety system,
and Azure content_filter; added user-actionable guidance and a clear
final_response so cron/gateway surfaces the policy block instead of a
generic non-retryable error, and added a regression-guard test module
mirroring the is_client_error predicate.

Addresses #18028.

Co-authored-by: Kuan-Chieh Huang <kchuang1015@users.noreply.github.com>

* chore: add kchuang1015 to AUTHOR_MAP

---------

Co-authored-by: Kuan-Chieh Huang <kchuang1015@users.noreply.github.com>
This commit is contained in:
kshitij 2026-05-28 07:28:24 -07:00 committed by GitHub
parent a82c88bac0
commit 0554ef1aa3
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
5 changed files with 334 additions and 6 deletions

View file

@ -59,6 +59,7 @@ class TestFailoverReason:
"invalid_encrypted_content",
"multimodal_tool_content_unsupported",
"provider_policy_blocked",
"content_policy_blocked",
"thinking_signature", "long_context_tier",
"oauth_long_context_beta_forbidden",
"llama_cpp_grammar_pattern",
@ -466,6 +467,78 @@ class TestClassifyApiError:
result = classify_api_error(e)
assert result.reason == FailoverReason.provider_policy_blocked
# ── Provider content-policy block (per-prompt safety filter) ──
#
# Distinct from ``provider_policy_blocked`` above — these are upstream
# model-provider safety refusals for THIS prompt, not OpenRouter
# account-level data policy. Recovery is fallback model, not config fix.
# See issue #18028 — OpenAI Codex was burning 3 retries on identical
# refusals before users saw "API failed after 3 retries" on Telegram.
def test_message_only_cyber_content_policy_blocked(self):
# OpenAI Codex returns this without an HTTP status. Retrying the
# same prompt three times only repeats the same policy decision, so
# the classifier must jump straight to fallback / abort instead of
# leaving it in the retryable ``unknown`` bucket.
e = Exception(
"This content was flagged for possible cybersecurity risk. If this "
"seems wrong, try rephrasing your request. To get authorized for "
"security work, join the Trusted Access for Cyber program."
)
result = classify_api_error(e, provider="openai-codex", model="gpt-5.5")
assert result.reason == FailoverReason.content_policy_blocked
assert result.retryable is False
assert result.should_fallback is True
assert result.should_compress is False
def test_400_cyber_content_policy_blocked(self):
# When the SDK does attach a status (e.g. 400), the safety pattern
# must still beat the format_error fallthrough.
e = MockAPIError(
"This content was flagged for possible cybersecurity risk",
status_code=400,
)
result = classify_api_error(e, provider="openai-codex", model="gpt-5.5")
assert result.reason == FailoverReason.content_policy_blocked
assert result.retryable is False
assert result.should_fallback is True
def test_openai_usage_policy_violation_content_policy_blocked(self):
# OpenAI moderation refusal wording from chat completions / responses.
e = MockAPIError(
"Your request was flagged by the moderation system as potentially "
"violating OpenAI's usage policies.",
status_code=400,
)
result = classify_api_error(e, provider="openai", model="gpt-4o")
assert result.reason == FailoverReason.content_policy_blocked
assert result.retryable is False
assert result.should_fallback is True
def test_anthropic_safety_system_content_policy_blocked(self):
# Anthropic safety refusal — distinct phrasing from OpenAI.
e = Exception(
"Your prompt was flagged by our safety system. Please rephrase "
"and try again."
)
result = classify_api_error(e, provider="anthropic", model="claude-3-5-sonnet")
assert result.reason == FailoverReason.content_policy_blocked
assert result.retryable is False
assert result.should_fallback is True
def test_azure_content_filter_content_policy_blocked(self):
# Azure OpenAI returns ``content_filter`` finish reason / error code
# and ``ResponsibleAIPolicyViolation`` in error bodies — both narrow
# tokens, not the generic English phrase.
e = MockAPIError(
"The response was filtered: ResponsibleAIPolicyViolation "
"(finish_reason=content_filter).",
status_code=400,
)
result = classify_api_error(e, provider="azure", model="gpt-4o")
assert result.reason == FailoverReason.content_policy_blocked
assert result.retryable is False
def test_404_model_not_found_still_works(self):
# Regression guard: the new policy-block check must not swallow
# genuine model_not_found 404s.