execute_code: set PYTHONIOENCODING=utf-8 + PYTHONUTF8=1 in child env

Third Windows-specific sandbox bug (after WinError 10106 and the UTF-8
file-write bug): user scripts that print non-ASCII to stdout crash with

    UnicodeEncodeError: 'charmap' codec can't encode character '\u2192'
                        in position N: character maps to <undefined>

Root cause: Python's sys.stdout on Windows is bound to the console code
page (cp1252 on US-locale installs) when the process is attached to a
pipe without PYTHONIOENCODING set.  LLM-generated scripts routinely
print em-dashes, arrows, accented chars, and emoji — all of which cp1252
can't encode.

Fix: spawn the sandbox child with:

    PYTHONIOENCODING=utf-8   # sys.stdin/stdout/stderr all UTF-8
    PYTHONUTF8=1             # PEP 540 UTF-8 mode — open() defaults to UTF-8 too

PYTHONUTF8 is the belt-and-suspenders half: LLM scripts that call
open(path, 'w') without encoding= in user code will now produce UTF-8
files by default, matching what the sandbox already does for its own
staging files.

The parent side already decodes child stdout/stderr as UTF-8 with
errors='replace' (lines 1345-1347) so the end-to-end chain is clean.

On POSIX these values usually match the locale default already, so
setting them is harmless belt-and-suspenders for C/POSIX-locale
containers and minimal base images.

Tests added (4) — total file now at 28 passed, 1 skipped on Windows:
  - test_popen_env_sets_pythonioencoding_utf8 (source grep)
  - test_popen_env_sets_pythonutf8_mode (source grep)
  - test_live_child_can_print_non_ascii (cross-platform live test)
  - test_windows_child_without_utf8_env_would_fail (Windows negative
    control — actually reproduces the bug without our env overrides,
    proving the fix is load-bearing on this system)
This commit is contained in:
Teknium 2026-05-07 18:59:35 -07:00
parent f5ec30dfe6
commit bf43f6cfdd
2 changed files with 152 additions and 0 deletions

View file

@ -1175,6 +1175,25 @@ def execute_code(
child_env = _scrub_child_env(os.environ)
child_env["HERMES_RPC_SOCKET"] = rpc_endpoint
child_env["PYTHONDONTWRITEBYTECODE"] = "1"
# Force UTF-8 for the child's stdio and default file encoding.
#
# Without this, on Windows sys.stdout is bound to the console code
# page (cp1252 on US-locale installs), and any script that does
# ``print("café")`` or ``print("→")`` crashes with:
#
# UnicodeEncodeError: 'charmap' codec can't encode character
# '\u2192' in position N: character maps to <undefined>
#
# PYTHONIOENCODING fixes sys.stdin/stdout/stderr.
# PYTHONUTF8=1 enables "UTF-8 mode" (PEP 540) which additionally
# makes ``open()``'s default encoding UTF-8, so user scripts that
# write files without specifying encoding= also work correctly.
#
# On POSIX both values usually match the locale default already,
# so setting them is harmless belt-and-suspenders for environments
# with a C/POSIX locale (containers, minimal base images).
child_env["PYTHONIOENCODING"] = "utf-8"
child_env["PYTHONUTF8"] = "1"
# Ensure the hermes-agent root is importable in the sandbox so
# repo-root modules are available to child scripts. We also prepend
# the staging tmpdir so ``from hermes_tools import ...`` resolves even