test_code_execution_modes.py had two test-level failures and two
class-level stale skip reasons on this Windows-native branch:
- TestResolveChildPython::test_project_with_virtualenv_picks_venv_python
- TestResolveChildPython::test_project_prefers_virtualenv_over_conda
Both fail on Windows with OSError: [WinError 1314] — they call
pathlib.Path.symlink_to() to build a fake venv, which requires
developer mode or admin on Windows. They also assume POSIX venv
layout (bin/python) where Windows uses Scripts/python.exe. Skip
them with a specific, accurate reason.
Also updated two class-level skipif reasons that said
'execute_code is POSIX-only' — no longer true on this branch.
New reason explains it's the test infrastructure (symlinks + POSIX
venv layout) that's the blocker, not execute_code itself.
Results on Windows Python 3.11:
Before: 41 passed, 10 skipped, 2 failed
After: 43 passed, 12 skipped, 0 failed
Weaker models (Gemma-class) repeatedly rediscover and forget that
execute_code uses a different CWD and Python interpreter than terminal(),
causing them to flip-flop on whether user files exist and to hit import
errors on project dependencies like pandas.
Adds a new 'code_execution.mode' config key (default 'project') that
brings execute_code into line with terminal()'s filesystem/interpreter:
project (new default):
- cwd = session's TERMINAL_CWD (falls back to os.getcwd())
- python = active VIRTUAL_ENV/bin/python or CONDA_PREFIX/bin/python
with a Python 3.8+ version check; falls back cleanly to
sys.executable if no venv or the candidate fails
- result : 'import pandas' works, '.env' resolves, matches terminal()
strict (opt-in):
- cwd = staging tmpdir (today's behavior)
- python = sys.executable (today's behavior)
- result : maximum reproducibility and isolation; project deps
won't resolve
Security-critical invariants are identical across both modes and covered by
explicit regression tests:
- env scrubbing (strips *_API_KEY, *_TOKEN, *_SECRET, *_PASSWORD,
*_CREDENTIAL, *_PASSWD, *_AUTH substrings)
- SANDBOX_ALLOWED_TOOLS whitelist (no execute_code recursion, no
delegate_task, no MCP from inside scripts)
- resource caps (5-min timeout, 50KB stdout, 50 tool calls)
Deliberately avoids 'sandbox'/'isolated'/'cloud' language in tool
descriptions (regression from commit 39b83f34 where agents on local
backends falsely believed they were sandboxed and refused networking).
Override via env var: HERMES_EXECUTE_CODE_MODE=strict|project