feat(prompt): universal task-completion guidance + local Python toolchain probe (#34340)

* fix(codex): surface error code in Responses 'failed' status errors

When a Codex Responses turn ends with status=failed, the response carries
the failure details under `response.error` as
`{code, message, param, ...}`. The previous extractor pulled only
`message`, so users seeing a rate-limit failure got a bare "Slow down"
string indistinguishable from a generic stream truncation; an
internal_error with empty message degraded to a dict dump
("{'code': 'internal_error', 'message': ''}").

Extract a `_format_responses_error()` helper that:
- prefixes `code` when both code and message are present
  (e.g. 'rate_limit_exceeded: Slow down')
- falls back to the bare `code` when message is empty
- accepts both dict and attribute-style payloads (SDK and JSON-RPC paths)
- preserves the prior status-only fallback when no error payload exists

Apply the same helper at the sibling site in
`codex_app_server_session.run_turn()` so codex-CLI subprocess turn
failures get the same treatment.

Tests:
- 8 new unit tests for `_format_responses_error` covering both shapes,
  empty/missing fields, non-string fields, and the status-only fallback.
- 2 regression tests on `_normalize_codex_response` for failed status
  with and without a code, asserting the exact RuntimeError message.
- All 3603 tests in tests/agent/ pass.

Adapted from anomalyco/opencode#28757.

* feat(prompt): universal task-completion guidance + local Python toolchain probe

Two cross-model failure modes get a single-line answer in the cached
system prompt. Both gated by config (default on), both add zero overhead
when not needed, both verified via real AIAgent prompt builds.

## What changed

`TASK_COMPLETION_GUIDANCE` — short prompt block applied to ALL models.
Targets two failure modes observed on a real Sarasota real-estate build
task: (1) Opus stopped after writing an 85-byte stub and gave a prose
response with finish_reason=stop on call #3 of 90; (2) DeepSeek pushed
through a PEP-668 wall, then returned fabricated listings instead of
admitting the blocker. Both behaviors are model-family-agnostic, so the
guidance lives outside the existing tool_use_enforcement gate (~192
tokens, paid once per session via prefix cache).

`tools/env_probe.py` — local Python toolchain probe. Detects
python3/pip/uv/PEP-668 state and emits ONE short line in the system
prompt when something is non-default. Emits NOTHING when the env is
clean (zero token cost for normal users). Skipped entirely for remote
terminal backends (docker/modal/ssh) — they have their own probe.

Example output on a broken environment (the actual case):

    Python toolchain: python3=3.11.15 (no pip module),
    python=missing (use python3), pip→python3.12 (mismatch),
    PEP 668=yes (use venv or uv).

## Config

Both flags live under `agent.` in config.yaml, default True:

    agent:
      task_completion_guidance: true   # universal "finish the job" block
      environment_probe: true          # local Python toolchain hints

Neither addition required a `_config_version` bump — deep-merge fills
defaults in for existing user configs.

## Validation

| Test surface | Result |
|---|---|
| tests/tools/test_env_probe.py | 10/10 pass (probe unit) |
| tests/run_agent/test_run_agent.py — new classes | 8/8 pass (integration) |
| TestToolUseEnforcementConfig | 17/17 pass (no regression) |
| TestBuildSystemPrompt | 9/9 pass (no regression) |
| TestInvalidateSystemPrompt | 2/2 pass (no regression) |
| tests/agent/test_prompt_builder.py | 124/124 pass (no regression) |
| tests/hermes_cli/ | 5662/5662 pass (config defaults) |
| E2E AIAgent build (broken env) | Both blocks present, 2,178 chars |
| E2E AIAgent build (clean env) | 771-char net overhead, env probe silent |
This commit is contained in:
Teknium 2026-05-28 22:26:09 -07:00 committed by GitHub
parent 75d2c081c9
commit a4d8f0f62a
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
10 changed files with 819 additions and 6 deletions

View file

@ -1201,6 +1201,18 @@ def init_agent(
_agent_section = {}
agent._tool_use_enforcement = _agent_section.get("tool_use_enforcement", "auto")
# Universal task-completion guidance toggle. Default True. Surfaced
# as a separate flag from tool_use_enforcement because the guidance
# applies to ALL models, not just the model families enforcement
# targets.
agent._task_completion_guidance = bool(_agent_section.get("task_completion_guidance", True))
# Local Python toolchain probe toggle. Default True. When False,
# the probe is skipped entirely (no subprocess calls, no system-prompt
# line). Useful for users on exotic setups where the probe heuristics
# are noisy.
agent._environment_probe = bool(_agent_section.get("environment_probe", True))
# App-level API retry count (wraps each model API call). Default 3,
# overridable via agent.api_max_retries in config.yaml. See #11616.
try:

View file

@ -980,6 +980,48 @@ def _extract_responses_reasoning_text(item: Any) -> str:
return ""
def _format_responses_error(error_obj: Any, response_status: str) -> str:
"""Build a human-readable error string from a Responses ``response.error`` payload.
The OpenAI Responses API carries failure details under ``response.error``
on terminal ``response.failed`` events, in the shape
``{"code": "rate_limit_exceeded", "message": "Slow down", "param": ...}``.
Earlier code only surfaced ``message``, which left users staring at bare
strings like ``"Slow down"`` while the failure mode (rate limit vs
context-length vs internal_error vs model-overloaded) was hidden in
``code``. We now prefix ``code`` when both are present so consumers can
distinguish failure modes without parsing the bare message.
Falls back to ``code`` alone when ``message`` is empty, and to a stable
default referencing the response status when no error payload is
available at all. Adapted from anomalyco/opencode#28757.
"""
# Pull code and message from either dict or attribute-style payloads.
code: Any = None
message: Any = None
if isinstance(error_obj, dict):
code = error_obj.get("code")
message = error_obj.get("message")
elif error_obj is not None:
code = getattr(error_obj, "code", None)
message = getattr(error_obj, "message", None)
code_str = str(code).strip() if isinstance(code, str) else (str(code).strip() if code else "")
message_str = str(message).strip() if isinstance(message, str) else (str(message).strip() if message else "")
if code_str and message_str:
return f"{code_str}: {message_str}"
if message_str:
return message_str
if code_str:
return code_str
if error_obj:
# Last-resort: stringify whatever the provider sent so it's at least
# visible in logs/UI rather than silently swallowed.
return str(error_obj)
return f"Responses API returned status '{response_status}'"
# ---------------------------------------------------------------------------
# Full response normalization
# ---------------------------------------------------------------------------
@ -1023,10 +1065,7 @@ def _normalize_codex_response(
if response_status in {"failed", "cancelled"}:
error_obj = getattr(response, "error", None)
if isinstance(error_obj, dict):
error_msg = error_obj.get("message") or str(error_obj)
else:
error_msg = str(error_obj) if error_obj else f"Responses API returned status '{response_status}'"
error_msg = _format_responses_error(error_obj, response_status)
raise RuntimeError(error_msg)
content_parts: List[str] = []

View file

@ -262,6 +262,37 @@ TOOL_USE_ENFORCEMENT_GUIDANCE = (
# Add new patterns here when a model family needs explicit steering.
TOOL_USE_ENFORCEMENT_MODELS = ("gpt", "codex", "gemini", "gemma", "grok", "glm", "qwen", "deepseek")
# Universal "finish the job" guidance — applied to ALL models, not gated
# by model family. Addresses two cross-model failure modes:
# 1. Stopping after a stub: writing a tiny file or running one command
# and then ending the turn with a description of the plan instead
# of the finished artifact. (Observed on Opus during a real
# Sarasota real-estate build task: 3 API calls, 85-byte file,
# one terminal command, finish_reason=stop.)
# 2. Fabricating output when a real path is blocked. When `pip` or a
# tool fails, some models will synthesize plausible-looking results
# (fake addresses, fake JSON, fake numbers) instead of reporting
# the blocker. (Observed on DeepSeek v4-flash on the same task:
# pushed through PEP-668 wall, then returned fabricated listings.)
#
# Short on purpose. This block is shipped to every user, every session,
# in the cached system prompt — token cost is paid once at install and
# then amortised across all sessions via prefix caching. Keep it tight.
TASK_COMPLETION_GUIDANCE = (
"# Finishing the job\n"
"When the user asks you to build, run, or verify something, the deliverable is "
"a working artifact backed by real tool output — not a description of one. "
"Do not stop after writing a stub, a plan, or a single command. Keep working "
"until you have actually exercised the code or produced the requested result, "
"then report what real execution returned.\n"
"If a tool, install, or network call fails and blocks the real path, say so "
"directly and try an alternative (different package manager, different "
"approach, ask the user). NEVER substitute plausible-looking fabricated "
"output (made-up data, invented file contents, synthesised API responses) "
"for results you couldn't actually produce. Reporting a blocker honestly "
"is always better than inventing a result."
)
# OpenAI GPT/Codex-specific execution guidance. Addresses known failure modes
# where GPT models abandon work on partial results, skip prerequisite lookups,
# hallucinate instead of using tools, and declare "done" without verification.

View file

@ -37,6 +37,7 @@ from agent.prompt_builder import (
PLATFORM_HINTS,
SESSION_SEARCH_GUIDANCE,
SKILLS_GUIDANCE,
TASK_COMPLETION_GUIDANCE,
TOOL_USE_ENFORCEMENT_GUIDANCE,
TOOL_USE_ENFORCEMENT_MODELS,
)
@ -100,6 +101,15 @@ def build_system_prompt_parts(agent: Any, system_message: Optional[str] = None)
# Pointer to the hermes-agent skill + docs for user questions about Hermes itself.
stable_parts.append(HERMES_AGENT_HELP_GUIDANCE)
# Universal task-completion / no-fabrication guidance. Applied to ALL
# models regardless of tool_use_enforcement gating — the failure modes
# this targets (stopping after a stub; fabricating output when a real
# path is blocked) are not model-family specific. Gated only by
# config.yaml ``agent.task_completion_guidance`` (default True) so
# users who want a leaner prompt can turn it off.
if getattr(agent, "_task_completion_guidance", True) and agent.valid_tool_names:
stable_parts.append(TASK_COMPLETION_GUIDANCE)
# Tool-aware behavioral guidance: only inject when the tools are loaded
tool_guidance = []
if "memory" in agent.valid_tool_names:
@ -205,6 +215,23 @@ def build_system_prompt_parts(agent: Any, system_message: Optional[str] = None)
if _env_hints:
stable_parts.append(_env_hints)
# Local Python toolchain probe — names python/pip/uv/PEP-668 state when
# something is non-default so the model can pick the right install
# strategy without discovering by failure. Emits a single line; emits
# NOTHING when the environment is clean (no token cost). Skipped
# entirely for remote terminal backends (the host's Python state is
# irrelevant when tools run inside docker/modal/ssh). Gated by
# config.yaml ``agent.environment_probe`` (default True).
if getattr(agent, "_environment_probe", True):
try:
from tools.env_probe import get_environment_probe_line
_probe_line = get_environment_probe_line()
if _probe_line:
stable_parts.append(_probe_line)
except Exception:
# Probe failure must never block prompt build.
pass
# Active-profile hint — names the Hermes profile the agent is running
# under so it doesn't conflate ~/.hermes/skills/ (default profile) with
# ~/.hermes/profiles/<active>/skills/ (this profile's). Deterministic

View file

@ -31,6 +31,7 @@ import time
from dataclasses import dataclass, field
from typing import Any, Callable, Optional
from agent.codex_responses_adapter import _format_responses_error
from agent.redact import redact_sensitive_text
from agent.transports.codex_app_server import (
CodexAppServerClient,
@ -581,7 +582,7 @@ class CodexAppServerSession:
(note.get("params") or {}).get("turn") or {}
).get("error")
if err_obj:
err_msg = err_obj.get("message") or str(err_obj)
err_msg = _format_responses_error(err_obj, str(turn_status))
# If the turn failed for an auth/refresh reason,
# rewrite the error into a re-auth hint AND mark
# the session for retirement.