mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-05-30 06:41:51 +00:00
* fix(codex): surface error code in Responses 'failed' status errors
When a Codex Responses turn ends with status=failed, the response carries
the failure details under `response.error` as
`{code, message, param, ...}`. The previous extractor pulled only
`message`, so users seeing a rate-limit failure got a bare "Slow down"
string indistinguishable from a generic stream truncation; an
internal_error with empty message degraded to a dict dump
("{'code': 'internal_error', 'message': ''}").
Extract a `_format_responses_error()` helper that:
- prefixes `code` when both code and message are present
(e.g. 'rate_limit_exceeded: Slow down')
- falls back to the bare `code` when message is empty
- accepts both dict and attribute-style payloads (SDK and JSON-RPC paths)
- preserves the prior status-only fallback when no error payload exists
Apply the same helper at the sibling site in
`codex_app_server_session.run_turn()` so codex-CLI subprocess turn
failures get the same treatment.
Tests:
- 8 new unit tests for `_format_responses_error` covering both shapes,
empty/missing fields, non-string fields, and the status-only fallback.
- 2 regression tests on `_normalize_codex_response` for failed status
with and without a code, asserting the exact RuntimeError message.
- All 3603 tests in tests/agent/ pass.
Adapted from anomalyco/opencode#28757.
* feat(prompt): universal task-completion guidance + local Python toolchain probe
Two cross-model failure modes get a single-line answer in the cached
system prompt. Both gated by config (default on), both add zero overhead
when not needed, both verified via real AIAgent prompt builds.
## What changed
`TASK_COMPLETION_GUIDANCE` — short prompt block applied to ALL models.
Targets two failure modes observed on a real Sarasota real-estate build
task: (1) Opus stopped after writing an 85-byte stub and gave a prose
response with finish_reason=stop on call #3 of 90; (2) DeepSeek pushed
through a PEP-668 wall, then returned fabricated listings instead of
admitting the blocker. Both behaviors are model-family-agnostic, so the
guidance lives outside the existing tool_use_enforcement gate (~192
tokens, paid once per session via prefix cache).
`tools/env_probe.py` — local Python toolchain probe. Detects
python3/pip/uv/PEP-668 state and emits ONE short line in the system
prompt when something is non-default. Emits NOTHING when the env is
clean (zero token cost for normal users). Skipped entirely for remote
terminal backends (docker/modal/ssh) — they have their own probe.
Example output on a broken environment (the actual case):
Python toolchain: python3=3.11.15 (no pip module),
python=missing (use python3), pip→python3.12 (mismatch),
PEP 668=yes (use venv or uv).
## Config
Both flags live under `agent.` in config.yaml, default True:
agent:
task_completion_guidance: true # universal "finish the job" block
environment_probe: true # local Python toolchain hints
Neither addition required a `_config_version` bump — deep-merge fills
defaults in for existing user configs.
## Validation
| Test surface | Result |
|---|---|
| tests/tools/test_env_probe.py | 10/10 pass (probe unit) |
| tests/run_agent/test_run_agent.py — new classes | 8/8 pass (integration) |
| TestToolUseEnforcementConfig | 17/17 pass (no regression) |
| TestBuildSystemPrompt | 9/9 pass (no regression) |
| TestInvalidateSystemPrompt | 2/2 pass (no regression) |
| tests/agent/test_prompt_builder.py | 124/124 pass (no regression) |
| tests/hermes_cli/ | 5662/5662 pass (config defaults) |
| E2E AIAgent build (broken env) | Both blocks present, 2,178 chars |
| E2E AIAgent build (clean env) | 771-char net overhead, env probe silent |
247 lines
8.4 KiB
Python
247 lines
8.4 KiB
Python
"""Local-environment toolchain probe for the system prompt.
|
|
|
|
When the terminal backend is local (the agent's tools run on the same
|
|
machine as Hermes itself), we surface a single deterministic line about
|
|
Python tooling state so models don't have to discover it by hitting
|
|
walls. Common failure modes this addresses:
|
|
|
|
* Hermes ships under one Python (e.g. 3.11 in a bundled venv) while the
|
|
user's login shell has a different one (e.g. 3.12 system). ``pip``
|
|
resolved from PATH may not match ``python3 -m pip``.
|
|
* The bundled-venv Python has no pip module installed → ``python3 -m
|
|
pip`` returns ``No module named pip``.
|
|
* The system Python is PEP-668 externally-managed → naive
|
|
``pip install`` fails with ``error: externally-managed-environment``.
|
|
|
|
The probe is cheap (a handful of subprocess calls, ~50ms total),
|
|
cached for the lifetime of the process, and emits **at most one
|
|
short line** when something non-default is detected. When the
|
|
environment looks normal (python3+pip both present and matched, no
|
|
PEP 668), it emits nothing — no token cost.
|
|
|
|
Remote terminal backends (docker, modal, ssh, …) are skipped: the
|
|
host's Python state is irrelevant when tools run inside a sandbox.
|
|
The sandbox has its own existing probe (``_probe_remote_backend``)
|
|
in ``agent/prompt_builder.py``.
|
|
|
|
Toggle via ``agent.environment_probe`` in config.yaml (default True).
|
|
"""
|
|
|
|
from __future__ import annotations
|
|
|
|
import logging
|
|
import os
|
|
import shutil
|
|
import subprocess
|
|
import sys
|
|
import threading
|
|
from typing import Optional
|
|
|
|
logger = logging.getLogger(__name__)
|
|
|
|
# Module-level cache. The probe result is deterministic for the
|
|
# lifetime of the process — Python install state doesn't change
|
|
# mid-session in any way that would matter for the system prompt.
|
|
_CACHE_LOCK = threading.Lock()
|
|
_CACHED_LINE: Optional[str] = None # None = not probed yet; "" = probed, nothing to say.
|
|
|
|
# Remote backends — keep in sync with agent/prompt_builder.py:_REMOTE_TERMINAL_BACKENDS.
|
|
# Duplicated rather than imported to avoid a circular import (prompt_builder
|
|
# imports nothing from tools).
|
|
_REMOTE_BACKENDS = frozenset({
|
|
"docker", "singularity", "modal", "daytona", "ssh", "managed_modal",
|
|
})
|
|
|
|
|
|
def _run(cmd: list[str], timeout: float = 3.0) -> tuple[int, str, str]:
|
|
"""Run a short subprocess. Returns (returncode, stdout, stderr).
|
|
|
|
Failures (binary missing, timeout, OSError) return (-1, "", "<reason>").
|
|
"""
|
|
try:
|
|
result = subprocess.run(
|
|
cmd,
|
|
capture_output=True,
|
|
text=True,
|
|
timeout=timeout,
|
|
check=False,
|
|
)
|
|
return result.returncode, (result.stdout or "").strip(), (result.stderr or "").strip()
|
|
except FileNotFoundError:
|
|
return -1, "", "not found"
|
|
except subprocess.TimeoutExpired:
|
|
return -1, "", "timeout"
|
|
except OSError as exc:
|
|
return -1, "", f"oserror: {exc}"
|
|
|
|
|
|
def _python_version_of(binary: str) -> Optional[str]:
|
|
"""Return a short version string like ``3.12.4`` for ``binary``, or None."""
|
|
if not shutil.which(binary):
|
|
return None
|
|
rc, out, err = _run([binary, "-c", "import sys; print(f'{sys.version_info.major}.{sys.version_info.minor}.{sys.version_info.micro}')"])
|
|
if rc == 0 and out:
|
|
return out
|
|
return None
|
|
|
|
|
|
def _has_pip_module(binary: str) -> bool:
|
|
"""True if ``<binary> -m pip --version`` succeeds."""
|
|
if not shutil.which(binary):
|
|
return False
|
|
rc, _out, _err = _run([binary, "-m", "pip", "--version"])
|
|
return rc == 0
|
|
|
|
|
|
def _detect_pep668(binary: str) -> bool:
|
|
"""True when ``<binary>``'s install location is PEP-668 externally-managed.
|
|
|
|
Looks for ``EXTERNALLY-MANAGED`` next to the stdlib (the marker file
|
|
Debian/Ubuntu drop in to gate naive ``pip install``).
|
|
"""
|
|
if not shutil.which(binary):
|
|
return False
|
|
code = (
|
|
"import sys, os;"
|
|
"stdlib = os.path.dirname(os.__file__);"
|
|
"marker = os.path.join(stdlib, 'EXTERNALLY-MANAGED');"
|
|
"print('yes' if os.path.exists(marker) else 'no')"
|
|
)
|
|
rc, out, _err = _run([binary, "-c", code])
|
|
return rc == 0 and out.strip() == "yes"
|
|
|
|
|
|
def _pip_python_version() -> Optional[str]:
|
|
"""If ``pip`` is on PATH, return the Python version it's bound to.
|
|
|
|
``pip --version`` output looks like::
|
|
|
|
pip 24.0 from /usr/lib/python3/dist-packages/pip (python 3.12)
|
|
|
|
Returns the parenthesised version (e.g. ``"3.12"``) or None.
|
|
"""
|
|
if not shutil.which("pip"):
|
|
return None
|
|
rc, out, _err = _run(["pip", "--version"])
|
|
if rc != 0 or not out:
|
|
return None
|
|
# Parse trailing "(python X.Y)".
|
|
if "(python " in out and out.endswith(")"):
|
|
try:
|
|
tail = out.rsplit("(python ", 1)[1]
|
|
return tail[:-1].strip()
|
|
except (IndexError, AttributeError):
|
|
return None
|
|
return None
|
|
|
|
|
|
def _build_probe_line() -> str:
|
|
"""Build the one-liner. Returns "" when nothing notable is detected.
|
|
|
|
Emit only when SOMETHING is off — the goal is to save the model from
|
|
hitting an avoidable wall, not to narrate a healthy environment.
|
|
"""
|
|
# Bail out if a remote terminal backend is configured; the host's
|
|
# Python state isn't where the agent's tools run.
|
|
backend = (os.getenv("TERMINAL_ENV") or "local").strip().lower()
|
|
if backend in _REMOTE_BACKENDS:
|
|
return ""
|
|
|
|
py3_ver = _python_version_of("python3")
|
|
py_ver = _python_version_of("python") # for systems with a `python` alias
|
|
py3_has_pip = _has_pip_module("python3") if py3_ver else False
|
|
pip_bound_to = _pip_python_version()
|
|
py3_pep668 = _detect_pep668("python3") if py3_ver else False
|
|
has_uv = shutil.which("uv") is not None
|
|
|
|
# If python3 exists, has pip, has uv (or no PEP 668), and there's no
|
|
# version mismatch between `pip` and `python3` → environment is
|
|
# clean enough to stay silent. The model can discover details by
|
|
# running commands if it cares.
|
|
mismatch = bool(pip_bound_to and py3_ver and not py3_ver.startswith(pip_bound_to))
|
|
silent_conditions = (
|
|
py3_ver is not None
|
|
and py3_has_pip
|
|
and not mismatch
|
|
and (not py3_pep668 or has_uv)
|
|
)
|
|
if silent_conditions:
|
|
return ""
|
|
|
|
# Build a compact factual summary. Keep it ONE line so it doesn't
|
|
# dominate the prompt; the model is good at parsing dense info.
|
|
bits: list[str] = []
|
|
if py3_ver:
|
|
py3_bit = f"python3={py3_ver}"
|
|
if not py3_has_pip:
|
|
py3_bit += " (no pip module)"
|
|
bits.append(py3_bit)
|
|
else:
|
|
bits.append("python3=missing")
|
|
|
|
if py_ver and py_ver != py3_ver:
|
|
bits.append(f"python={py_ver}")
|
|
elif not py_ver and py3_ver:
|
|
# Common on Debian/Ubuntu — call it out so the model doesn't
|
|
# type `python` and hit "command not found".
|
|
bits.append("python=missing (use python3)")
|
|
|
|
if pip_bound_to:
|
|
if mismatch:
|
|
bits.append(f"pip→python{pip_bound_to} (mismatch)")
|
|
elif not py3_has_pip:
|
|
# pip exists but `python3 -m pip` doesn't — the script
|
|
# works but the module path doesn't.
|
|
bits.append(f"pip→python{pip_bound_to}")
|
|
elif py3_has_pip:
|
|
# `pip` not on PATH but `python3 -m pip` works.
|
|
pass
|
|
else:
|
|
bits.append("pip=missing")
|
|
|
|
if py3_pep668:
|
|
bits.append("PEP 668=yes (use venv or uv)")
|
|
|
|
if has_uv:
|
|
bits.append("uv=installed")
|
|
|
|
if not bits:
|
|
return ""
|
|
|
|
return "Python toolchain: " + ", ".join(bits) + "."
|
|
|
|
|
|
def get_environment_probe_line(*, force_refresh: bool = False) -> str:
|
|
"""Return the cached probe line (building it on first call).
|
|
|
|
Returns "" when the environment is clean — the system prompt
|
|
assembler should drop the section in that case rather than
|
|
emit an empty heading.
|
|
|
|
``force_refresh`` is for tests; real callers should never need it.
|
|
"""
|
|
global _CACHED_LINE
|
|
if force_refresh:
|
|
with _CACHE_LOCK:
|
|
_CACHED_LINE = None
|
|
|
|
if _CACHED_LINE is not None:
|
|
return _CACHED_LINE
|
|
|
|
with _CACHE_LOCK:
|
|
if _CACHED_LINE is not None: # raced
|
|
return _CACHED_LINE
|
|
try:
|
|
line = _build_probe_line()
|
|
except Exception as exc: # never let probe failure block prompt build
|
|
logger.debug("env_probe failed: %s", exc)
|
|
line = ""
|
|
_CACHED_LINE = line
|
|
return line
|
|
|
|
|
|
def _reset_cache_for_tests() -> None:
|
|
"""Test helper — clear the cache between probe scenarios."""
|
|
global _CACHED_LINE
|
|
with _CACHE_LOCK:
|
|
_CACHED_LINE = None
|