Third Windows-specific sandbox bug (after WinError 10106 and the UTF-8
file-write bug): user scripts that print non-ASCII to stdout crash with
UnicodeEncodeError: 'charmap' codec can't encode character '\u2192'
in position N: character maps to <undefined>
Root cause: Python's sys.stdout on Windows is bound to the console code
page (cp1252 on US-locale installs) when the process is attached to a
pipe without PYTHONIOENCODING set. LLM-generated scripts routinely
print em-dashes, arrows, accented chars, and emoji — all of which cp1252
can't encode.
Fix: spawn the sandbox child with:
PYTHONIOENCODING=utf-8 # sys.stdin/stdout/stderr all UTF-8
PYTHONUTF8=1 # PEP 540 UTF-8 mode — open() defaults to UTF-8 too
PYTHONUTF8 is the belt-and-suspenders half: LLM scripts that call
open(path, 'w') without encoding= in user code will now produce UTF-8
files by default, matching what the sandbox already does for its own
staging files.
The parent side already decodes child stdout/stderr as UTF-8 with
errors='replace' (lines 1345-1347) so the end-to-end chain is clean.
On POSIX these values usually match the locale default already, so
setting them is harmless belt-and-suspenders for C/POSIX-locale
containers and minimal base images.
Tests added (4) — total file now at 28 passed, 1 skipped on Windows:
- test_popen_env_sets_pythonioencoding_utf8 (source grep)
- test_popen_env_sets_pythonutf8_mode (source grep)
- test_live_child_can_print_non_ascii (cross-platform live test)
- test_windows_child_without_utf8_env_would_fail (Windows negative
control — actually reproduces the bug without our env overrides,
proving the fix is load-bearing on this system)
Second Windows-specific sandbox bug (WinError 10106 was the first):
after the env-scrub fix let the child start, it immediately failed to
import hermes_tools with:
SyntaxError: (unicode error) 'utf-8' codec can't decode byte 0x97
in position 154: invalid start byte
Root cause: _execute_local wrote the generated hermes_tools.py stub and
the user's script.py via open(path, 'w') without encoding=. On Windows
the default text-mode encoding is cp1252 (system locale), which encodes
em-dashes (used in the stub's docstrings) as 0x97. Python then decodes
source files as UTF-8 (PEP 3120) on import, chokes on 0x97, and the
sandbox dies before any tool call.
Fix: pass encoding='utf-8' to all four file opens in the code_execution
path — the two staging writes in _execute_local (hermes_tools.py +
script.py) and the two RPC file-transport reads/writes in the generated
remote stub. JSON is ASCII-safe for most payloads but tool results
(terminal output, web_extract content) routinely carry non-ASCII.
Tests added (4):
- test_stub_and_script_writes_specify_utf8 — source grep guard
- test_file_rpc_stub_uses_utf8 — generated remote stub check
- test_stub_source_roundtrips_through_utf8 — concrete round-trip
- test_windows_default_encoding_would_have_failed — negative control
(skips on modern Python builds where default is already UTF-8
compatible, but retained for platforms where the regression could
return)
24/25 tests pass on Windows 3.11 (negative control skips because this
Python build handles em-dashes via cp1252 subset — the fix is still
correct, just the corruption path isn't always triggerable).
Adds TestPosixEquivalence to test_code_execution_windows_env.py. The
class pins the invariant that _scrub_child_env(env, is_windows=False)
produces byte-for-byte identical output to the pre-refactor inline
scrubber, across a matrix of:
- 2 synthetic envs (POSIX-shaped, Windows-shaped-on-POSIX)
- 3 passthrough rules (none, single-var, everything)
- 1 real-os.environ check on whatever platform runs the test
Plus a superset sanity check: is_windows=True must keep everything
is_windows=False keeps, and any extras must come from the
_WINDOWS_ESSENTIAL_ENV_VARS allowlist.
Rationale: the previous commit refactored the env-scrubbing inline
block into a helper. Future changes to that helper must not silently
regress POSIX behavior — if someone needs to change it, they update
_legacy_posix_scrubber in lockstep so the churn is visible in review.
All 21 tests in the file pass locally on Windows (pytest 9.0.3). 8 of
them are parametrized equivalence checks that run on every OS.
The sandbox's env scrubbing was dropping SYSTEMROOT, WINDIR, COMSPEC,
APPDATA, etc. On Windows this broke the child process before any RPC
could happen:
OSError: [WinError 10106] The requested service provider could not
be loaded or initialized
Python's socket module uses SYSTEMROOT to locate mswsock.dll during
Winsock initialization. Without it, socket.socket(AF_INET, SOCK_STREAM)
fails — and the existing loopback-TCP fallback for Windows couldn't work.
Fix: add a small Windows-only allowlist (_WINDOWS_ESSENTIAL_ENV_VARS)
matched by exact uppercase name, after the existing secret-substring
block. The secret block still runs first, so the allowlist cannot be
used to exfiltrate credentials. Also extract the env scrubber into a
testable helper (_scrub_child_env) that takes is_windows as a parameter,
so the logic can be unit-tested on any OS.
Live Winsock smoke test verifies that a child spawned with the scrubbed
env can now create an AF_INET socket on a real Windows host; the test
is guarded by sys.platform == 'win32' so POSIX CI stays green.