mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-05-11 03:31:55 +00:00
hermes_bootstrap: Windows-only UTF-8 stdio shim for all entry points
Codebase-wide fix for Python-on-Windows UTF-8 footguns, complementing
the earlier execute_code sandbox fixes (which remain load-bearing for
when the sandbox explicitly scrubs child env).
Problem: Python on Windows has two long-standing text-encoding pitfalls:
1. sys.stdout/stderr are bound to the console code page (cp1252 on
US-locale installs) — print('café') crashes with UnicodeEncodeError.
2. Subprocess children don't know to use UTF-8 unless PYTHONUTF8 and/or
PYTHONIOENCODING are set in their env — so any Python we spawn
(linters, sandbox children, delegation workers) hits the same bug.
Solution: A tiny bootstrap module (hermes_bootstrap.py) imported as the
first statement of every Hermes entry point:
- hermes_cli/main.py (hermes / hermes-agent console_script)
- run_agent.py (hermes-agent direct)
- acp_adapter/entry.py (hermes-acp)
- gateway/run.py (messaging gateway)
- batch_runner.py (parallel batch mode)
- cli.py (legacy direct-launch CLI)
On Windows, the bootstrap:
- os.environ.setdefault('PYTHONUTF8', '1') (PEP 540 UTF-8 mode)
- os.environ.setdefault('PYTHONIOENCODING', 'utf-8')
- sys.stdout/stderr/stdin.reconfigure(encoding='utf-8', errors='replace')
Children inherit the env vars → they run in UTF-8 mode.
Current process's stdio is reconfigured → print('café') works now.
On POSIX (Linux/macOS), the bootstrap is a complete no-op. We don't
touch LANG, LC_*, or anything else — users who have intentionally
configured a non-UTF-8 locale aren't affected. POSIX systems are
already UTF-8 by default in 99% of modern setups, so there's nothing
to fix.
setdefault() (not overwrite) means users who explicitly set PYTHONUTF8=0
or PYTHONIOENCODING=cp1252 in their environment are respected.
What this does NOT fix: bare open(path, 'w') calls in the *parent*
process still default to locale encoding because PYTHONUTF8 is only
read at interpreter init. A ruff PLW1514 sweep (separate follow-up)
will add explicit encoding='utf-8' at those ~219 call sites for
belt-and-suspenders.
Tests (17): 16 passed, 1 skipped on Windows.
- Windows: env vars set, stdio reconfigured, child inherits UTF-8 mode
- POSIX: complete no-op (verified on fake POSIX + skipped on real
POSIX since we don't have a Linux box in this session)
- Idempotence: multiple calls safe
- Graceful degradation: non-reconfigurable streams don't crash
- User opt-out: explicit PYTHONUTF8=0 is respected
- Load order: every entry point's FIRST top-level import is
hermes_bootstrap, enforced by an AST-level parametrized test
pyproject.toml: added hermes_bootstrap to py-modules so it ships with
pip installs.
This commit is contained in:
parent
bf43f6cfdd
commit
6098272454
9 changed files with 452 additions and 2 deletions
129
hermes_bootstrap.py
Normal file
129
hermes_bootstrap.py
Normal file
|
|
@ -0,0 +1,129 @@
|
|||
"""Windows UTF-8 bootstrap for Hermes entry points.
|
||||
|
||||
Python on Windows has two long-standing text-encoding footguns:
|
||||
|
||||
1. ``sys.stdout`` / ``sys.stderr`` are bound to the console code page
|
||||
(``cp1252`` on US-locale installs), so ``print("café")`` crashes with
|
||||
``UnicodeEncodeError: 'charmap' codec can't encode character``.
|
||||
|
||||
2. Child processes spawned via ``subprocess`` don't know to use UTF-8
|
||||
unless ``PYTHONUTF8`` and/or ``PYTHONIOENCODING`` are set in their
|
||||
environment — so any Python subprocess (the execute_code sandbox,
|
||||
delegation children, linter subprocesses, etc.) inherits the same
|
||||
cp1252 defaults and hits the same UnicodeEncodeError.
|
||||
|
||||
This module fixes both on Windows *only* — POSIX is untouched. It
|
||||
should be imported at the very top of every Hermes entry point
|
||||
(``hermes``, ``hermes-agent``, ``hermes-acp``, ``python -m gateway.run``,
|
||||
``batch_runner.py``, ``cron/scheduler.py``) before any other imports
|
||||
that might do file I/O or print to stdout.
|
||||
|
||||
What this module does on Windows:
|
||||
|
||||
- Sets ``os.environ["PYTHONUTF8"] = "1"`` (PEP 540 UTF-8 mode) so
|
||||
every child process we spawn uses UTF-8 for ``open()`` and stdio.
|
||||
- Sets ``os.environ["PYTHONIOENCODING"] = "utf-8"`` for belt-and-
|
||||
suspenders — some tools read this instead of / in addition to
|
||||
``PYTHONUTF8``.
|
||||
- Reconfigures ``sys.stdout`` / ``sys.stderr`` to UTF-8 in the current
|
||||
process, using the ``reconfigure()`` API (Python 3.7+). This fixes
|
||||
``print("café")`` in the parent without a re-exec.
|
||||
|
||||
What this module does NOT do:
|
||||
|
||||
- It does not re-exec Python with ``-X utf8``, so ``open()`` calls in
|
||||
the *current* process still default to locale encoding. Those need
|
||||
an explicit ``encoding="utf-8"`` at the call site (lint rule
|
||||
``PLW1514`` / ``PYI058``). Ruff is the right tool for that sweep.
|
||||
|
||||
What this module does on POSIX:
|
||||
|
||||
- Nothing. POSIX systems are already UTF-8 by default in 99% of cases,
|
||||
and we don't want to touch ``LANG``/``LC_*`` behavior that users may
|
||||
have configured intentionally. If someone hits a C/POSIX locale on
|
||||
Linux, they can export ``PYTHONUTF8=1`` themselves — we won't override.
|
||||
|
||||
Idempotent: safe to call multiple times. ``_bootstrap_once`` guards
|
||||
against double-reconfigure.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import os
|
||||
import sys
|
||||
|
||||
_IS_WINDOWS = sys.platform == "win32"
|
||||
_bootstrap_applied = False
|
||||
|
||||
|
||||
def apply_windows_utf8_bootstrap() -> bool:
|
||||
"""Apply the Windows UTF-8 bootstrap if we're on Windows.
|
||||
|
||||
Returns True if bootstrap was applied (i.e. we're on Windows and
|
||||
haven't already done this), False otherwise. The return value is
|
||||
advisory — callers normally don't need it, but tests may want to
|
||||
assert the path was taken.
|
||||
|
||||
Idempotent: subsequent calls after the first are a no-op.
|
||||
"""
|
||||
global _bootstrap_applied
|
||||
|
||||
if not _IS_WINDOWS:
|
||||
return False
|
||||
if _bootstrap_applied:
|
||||
return False
|
||||
|
||||
# 1. Child processes inherit these and run in UTF-8 mode.
|
||||
# We use setdefault() rather than overwriting so the user can
|
||||
# explicitly opt out by setting PYTHONUTF8=0 in their environment
|
||||
# (or PYTHONIOENCODING=something-else) if they really want to.
|
||||
os.environ.setdefault("PYTHONUTF8", "1")
|
||||
os.environ.setdefault("PYTHONIOENCODING", "utf-8")
|
||||
|
||||
# 2. Reconfigure the current process's stdio to UTF-8. Needed
|
||||
# because os.environ changes don't retroactively rebind sys.stdout
|
||||
# — those were bound at interpreter startup based on the console
|
||||
# code page. ``reconfigure`` is a TextIOWrapper method since 3.7.
|
||||
#
|
||||
# errors="replace" means that if we ever *read* something from
|
||||
# stdin that isn't UTF-8 (unlikely but possible with piped input
|
||||
# from legacy tools), we'll get U+FFFD replacement chars rather
|
||||
# than a crash. Output is pure UTF-8.
|
||||
for stream_name in ("stdout", "stderr"):
|
||||
stream = getattr(sys, stream_name, None)
|
||||
if stream is None:
|
||||
continue
|
||||
reconfigure = getattr(stream, "reconfigure", None)
|
||||
if reconfigure is None:
|
||||
# Not a TextIOWrapper (could be redirected to a BytesIO in
|
||||
# tests, or a non-standard stream in some embedded cases).
|
||||
# Skip silently — the env-var fix is still in effect for
|
||||
# child processes, which is the bigger win.
|
||||
continue
|
||||
try:
|
||||
reconfigure(encoding="utf-8", errors="replace")
|
||||
except (OSError, ValueError):
|
||||
# Already closed, or someone replaced it with something
|
||||
# non-reconfigurable. Non-fatal.
|
||||
pass
|
||||
|
||||
# stdin is reconfigured separately with errors="replace" too — input
|
||||
# from a legacy pipe shouldn't crash the process.
|
||||
stdin = getattr(sys, "stdin", None)
|
||||
if stdin is not None:
|
||||
reconfigure = getattr(stdin, "reconfigure", None)
|
||||
if reconfigure is not None:
|
||||
try:
|
||||
reconfigure(encoding="utf-8", errors="replace")
|
||||
except (OSError, ValueError):
|
||||
pass
|
||||
|
||||
_bootstrap_applied = True
|
||||
return True
|
||||
|
||||
|
||||
# Apply on import — entry points just need ``import hermes_bootstrap``
|
||||
# (or ``from hermes_bootstrap import apply_windows_utf8_bootstrap``) at
|
||||
# the very top of their module, before importing anything else. The
|
||||
# import side effect does the right thing.
|
||||
apply_windows_utf8_bootstrap()
|
||||
Loading…
Add table
Add a link
Reference in a new issue