hermes-agent/tools/env_probe.py
Teknium a4d8f0f62a
feat(prompt): universal task-completion guidance + local Python toolchain probe (#34340)
* fix(codex): surface error code in Responses 'failed' status errors

When a Codex Responses turn ends with status=failed, the response carries
the failure details under `response.error` as
`{code, message, param, ...}`. The previous extractor pulled only
`message`, so users seeing a rate-limit failure got a bare "Slow down"
string indistinguishable from a generic stream truncation; an
internal_error with empty message degraded to a dict dump
("{'code': 'internal_error', 'message': ''}").

Extract a `_format_responses_error()` helper that:
- prefixes `code` when both code and message are present
  (e.g. 'rate_limit_exceeded: Slow down')
- falls back to the bare `code` when message is empty
- accepts both dict and attribute-style payloads (SDK and JSON-RPC paths)
- preserves the prior status-only fallback when no error payload exists

Apply the same helper at the sibling site in
`codex_app_server_session.run_turn()` so codex-CLI subprocess turn
failures get the same treatment.

Tests:
- 8 new unit tests for `_format_responses_error` covering both shapes,
  empty/missing fields, non-string fields, and the status-only fallback.
- 2 regression tests on `_normalize_codex_response` for failed status
  with and without a code, asserting the exact RuntimeError message.
- All 3603 tests in tests/agent/ pass.

Adapted from anomalyco/opencode#28757.

* feat(prompt): universal task-completion guidance + local Python toolchain probe

Two cross-model failure modes get a single-line answer in the cached
system prompt. Both gated by config (default on), both add zero overhead
when not needed, both verified via real AIAgent prompt builds.

## What changed

`TASK_COMPLETION_GUIDANCE` — short prompt block applied to ALL models.
Targets two failure modes observed on a real Sarasota real-estate build
task: (1) Opus stopped after writing an 85-byte stub and gave a prose
response with finish_reason=stop on call #3 of 90; (2) DeepSeek pushed
through a PEP-668 wall, then returned fabricated listings instead of
admitting the blocker. Both behaviors are model-family-agnostic, so the
guidance lives outside the existing tool_use_enforcement gate (~192
tokens, paid once per session via prefix cache).

`tools/env_probe.py` — local Python toolchain probe. Detects
python3/pip/uv/PEP-668 state and emits ONE short line in the system
prompt when something is non-default. Emits NOTHING when the env is
clean (zero token cost for normal users). Skipped entirely for remote
terminal backends (docker/modal/ssh) — they have their own probe.

Example output on a broken environment (the actual case):

    Python toolchain: python3=3.11.15 (no pip module),
    python=missing (use python3), pip→python3.12 (mismatch),
    PEP 668=yes (use venv or uv).

## Config

Both flags live under `agent.` in config.yaml, default True:

    agent:
      task_completion_guidance: true   # universal "finish the job" block
      environment_probe: true          # local Python toolchain hints

Neither addition required a `_config_version` bump — deep-merge fills
defaults in for existing user configs.

## Validation

| Test surface | Result |
|---|---|
| tests/tools/test_env_probe.py | 10/10 pass (probe unit) |
| tests/run_agent/test_run_agent.py — new classes | 8/8 pass (integration) |
| TestToolUseEnforcementConfig | 17/17 pass (no regression) |
| TestBuildSystemPrompt | 9/9 pass (no regression) |
| TestInvalidateSystemPrompt | 2/2 pass (no regression) |
| tests/agent/test_prompt_builder.py | 124/124 pass (no regression) |
| tests/hermes_cli/ | 5662/5662 pass (config defaults) |
| E2E AIAgent build (broken env) | Both blocks present, 2,178 chars |
| E2E AIAgent build (clean env) | 771-char net overhead, env probe silent |
2026-05-28 22:26:09 -07:00

247 lines
8.4 KiB
Python

"""Local-environment toolchain probe for the system prompt.
When the terminal backend is local (the agent's tools run on the same
machine as Hermes itself), we surface a single deterministic line about
Python tooling state so models don't have to discover it by hitting
walls. Common failure modes this addresses:
* Hermes ships under one Python (e.g. 3.11 in a bundled venv) while the
user's login shell has a different one (e.g. 3.12 system). ``pip``
resolved from PATH may not match ``python3 -m pip``.
* The bundled-venv Python has no pip module installed → ``python3 -m
pip`` returns ``No module named pip``.
* The system Python is PEP-668 externally-managed → naive
``pip install`` fails with ``error: externally-managed-environment``.
The probe is cheap (a handful of subprocess calls, ~50ms total),
cached for the lifetime of the process, and emits **at most one
short line** when something non-default is detected. When the
environment looks normal (python3+pip both present and matched, no
PEP 668), it emits nothing — no token cost.
Remote terminal backends (docker, modal, ssh, …) are skipped: the
host's Python state is irrelevant when tools run inside a sandbox.
The sandbox has its own existing probe (``_probe_remote_backend``)
in ``agent/prompt_builder.py``.
Toggle via ``agent.environment_probe`` in config.yaml (default True).
"""
from __future__ import annotations
import logging
import os
import shutil
import subprocess
import sys
import threading
from typing import Optional
logger = logging.getLogger(__name__)
# Module-level cache. The probe result is deterministic for the
# lifetime of the process — Python install state doesn't change
# mid-session in any way that would matter for the system prompt.
_CACHE_LOCK = threading.Lock()
_CACHED_LINE: Optional[str] = None # None = not probed yet; "" = probed, nothing to say.
# Remote backends — keep in sync with agent/prompt_builder.py:_REMOTE_TERMINAL_BACKENDS.
# Duplicated rather than imported to avoid a circular import (prompt_builder
# imports nothing from tools).
_REMOTE_BACKENDS = frozenset({
"docker", "singularity", "modal", "daytona", "ssh", "managed_modal",
})
def _run(cmd: list[str], timeout: float = 3.0) -> tuple[int, str, str]:
"""Run a short subprocess. Returns (returncode, stdout, stderr).
Failures (binary missing, timeout, OSError) return (-1, "", "<reason>").
"""
try:
result = subprocess.run(
cmd,
capture_output=True,
text=True,
timeout=timeout,
check=False,
)
return result.returncode, (result.stdout or "").strip(), (result.stderr or "").strip()
except FileNotFoundError:
return -1, "", "not found"
except subprocess.TimeoutExpired:
return -1, "", "timeout"
except OSError as exc:
return -1, "", f"oserror: {exc}"
def _python_version_of(binary: str) -> Optional[str]:
"""Return a short version string like ``3.12.4`` for ``binary``, or None."""
if not shutil.which(binary):
return None
rc, out, err = _run([binary, "-c", "import sys; print(f'{sys.version_info.major}.{sys.version_info.minor}.{sys.version_info.micro}')"])
if rc == 0 and out:
return out
return None
def _has_pip_module(binary: str) -> bool:
"""True if ``<binary> -m pip --version`` succeeds."""
if not shutil.which(binary):
return False
rc, _out, _err = _run([binary, "-m", "pip", "--version"])
return rc == 0
def _detect_pep668(binary: str) -> bool:
"""True when ``<binary>``'s install location is PEP-668 externally-managed.
Looks for ``EXTERNALLY-MANAGED`` next to the stdlib (the marker file
Debian/Ubuntu drop in to gate naive ``pip install``).
"""
if not shutil.which(binary):
return False
code = (
"import sys, os;"
"stdlib = os.path.dirname(os.__file__);"
"marker = os.path.join(stdlib, 'EXTERNALLY-MANAGED');"
"print('yes' if os.path.exists(marker) else 'no')"
)
rc, out, _err = _run([binary, "-c", code])
return rc == 0 and out.strip() == "yes"
def _pip_python_version() -> Optional[str]:
"""If ``pip`` is on PATH, return the Python version it's bound to.
``pip --version`` output looks like::
pip 24.0 from /usr/lib/python3/dist-packages/pip (python 3.12)
Returns the parenthesised version (e.g. ``"3.12"``) or None.
"""
if not shutil.which("pip"):
return None
rc, out, _err = _run(["pip", "--version"])
if rc != 0 or not out:
return None
# Parse trailing "(python X.Y)".
if "(python " in out and out.endswith(")"):
try:
tail = out.rsplit("(python ", 1)[1]
return tail[:-1].strip()
except (IndexError, AttributeError):
return None
return None
def _build_probe_line() -> str:
"""Build the one-liner. Returns "" when nothing notable is detected.
Emit only when SOMETHING is off — the goal is to save the model from
hitting an avoidable wall, not to narrate a healthy environment.
"""
# Bail out if a remote terminal backend is configured; the host's
# Python state isn't where the agent's tools run.
backend = (os.getenv("TERMINAL_ENV") or "local").strip().lower()
if backend in _REMOTE_BACKENDS:
return ""
py3_ver = _python_version_of("python3")
py_ver = _python_version_of("python") # for systems with a `python` alias
py3_has_pip = _has_pip_module("python3") if py3_ver else False
pip_bound_to = _pip_python_version()
py3_pep668 = _detect_pep668("python3") if py3_ver else False
has_uv = shutil.which("uv") is not None
# If python3 exists, has pip, has uv (or no PEP 668), and there's no
# version mismatch between `pip` and `python3` → environment is
# clean enough to stay silent. The model can discover details by
# running commands if it cares.
mismatch = bool(pip_bound_to and py3_ver and not py3_ver.startswith(pip_bound_to))
silent_conditions = (
py3_ver is not None
and py3_has_pip
and not mismatch
and (not py3_pep668 or has_uv)
)
if silent_conditions:
return ""
# Build a compact factual summary. Keep it ONE line so it doesn't
# dominate the prompt; the model is good at parsing dense info.
bits: list[str] = []
if py3_ver:
py3_bit = f"python3={py3_ver}"
if not py3_has_pip:
py3_bit += " (no pip module)"
bits.append(py3_bit)
else:
bits.append("python3=missing")
if py_ver and py_ver != py3_ver:
bits.append(f"python={py_ver}")
elif not py_ver and py3_ver:
# Common on Debian/Ubuntu — call it out so the model doesn't
# type `python` and hit "command not found".
bits.append("python=missing (use python3)")
if pip_bound_to:
if mismatch:
bits.append(f"pip→python{pip_bound_to} (mismatch)")
elif not py3_has_pip:
# pip exists but `python3 -m pip` doesn't — the script
# works but the module path doesn't.
bits.append(f"pip→python{pip_bound_to}")
elif py3_has_pip:
# `pip` not on PATH but `python3 -m pip` works.
pass
else:
bits.append("pip=missing")
if py3_pep668:
bits.append("PEP 668=yes (use venv or uv)")
if has_uv:
bits.append("uv=installed")
if not bits:
return ""
return "Python toolchain: " + ", ".join(bits) + "."
def get_environment_probe_line(*, force_refresh: bool = False) -> str:
"""Return the cached probe line (building it on first call).
Returns "" when the environment is clean — the system prompt
assembler should drop the section in that case rather than
emit an empty heading.
``force_refresh`` is for tests; real callers should never need it.
"""
global _CACHED_LINE
if force_refresh:
with _CACHE_LOCK:
_CACHED_LINE = None
if _CACHED_LINE is not None:
return _CACHED_LINE
with _CACHE_LOCK:
if _CACHED_LINE is not None: # raced
return _CACHED_LINE
try:
line = _build_probe_line()
except Exception as exc: # never let probe failure block prompt build
logger.debug("env_probe failed: %s", exc)
line = ""
_CACHED_LINE = line
return line
def _reset_cache_for_tests() -> None:
"""Test helper — clear the cache between probe scenarios."""
global _CACHED_LINE
with _CACHE_LOCK:
_CACHED_LINE = None