test: use subprocesses for each test file (#29016)

* ci(tests): install ripgrep from prebuilt tarball instead of apt apt-get update + install of ripgrep takes ~4 min on the GHA Ubuntu runners (the apt-get update against archive.ubuntu.com is the slow part; ripgrep itself is small). Switching to the upstream musl binary tarball cuts the step to a few seconds. - Pinned to ripgrep 15.1.0 with sha256 verification (same hash as published in the releases sha256 sidecar file). - Drops the `rg` binary into /usr/local/bin so it is on PATH for every subsequent step without GITHUB_PATH manipulation. - Applied to both the test and e2e jobs in tests.yml. * fix(cli): compile syntax check to tempdir, not source __pycache__ `_validate_critical_files_syntax` runs `py_compile.compile()` on each critical bootstrap file after a successful `git pull`. The default `py_compile` writes the resulting `.pyc` next to the source under `__pycache__/`, which causes two real problems: 1. Parallel test workers walking the same source tree (e.g. running the suite under per-file process isolation) can race against each other on the `__pycache__` write — manifests as flaky 'directory not empty' errors during teardown. 2. In production, the post-pull syntax check leaves a `.pyc` behind that the next interpreter run might pick up — fine when the interpreter version matches, sketchy if it doesn't. Fix: write the compiled output to a `tempfile.TemporaryDirectory()` that's discarded on function exit. We only care about the compile-or-not signal, not the artifact. * test(runner): per-file process isolation, drop manual state reset + xdist Replace fragile manual _reset_module_state test fixtures with robust per-file subprocess isolation. Each test file runs in a fresh `python -m pytest <file>` subprocess via ThreadPoolExecutor. No xdist, no custom pytest plugin, no shared worker state. Key changes: * scripts/run_tests_parallel.py — new runner: discovers test files, runs N in parallel via ThreadPoolExecutor, captures stdout per file, treats exit code 5 (no tests collected) as pass, kills all children on exit. Change from cpu_count to cpu_count*2. The runner is I/O-bound (waiting on subprocess.communicate() from pytest children) The parent process does almost no CPU work, so 2x oversubscription keeps more pipes full. When a file fails, immediately show the last 30 lines of pytest output (stack traces + FAILED summary) plus a ready-to-copy repro command: python -m pytest tests/agent/test_auxiliary_client.py * scripts/run_tests.sh — delegates to run_tests_parallel.py * .github/workflows/tests.yml — test step: python scripts/run_tests_parallel.py * pyproject.toml — drop pytest-xdist, pytest-split; simplify addopts * tests/conftest.py — remove ~200 lines of manual state-reset fixtures * AGENTS.md — update Testing section for per-file design * test(runner): speed gateway test antipattern scan up * fix(test): web search provider plugin test missing xai * fix(tests): make 14 test files pass under per-file subprocess isolation Tests that relied on cross-file state pollution from xdist workers fail when run in isolation (per-file subprocess model). Root causes and fixes: Tool registry not populated: - test_video_generation_tool_surface_matrix: add discover_builtin_tools() - test_web_providers_brave_free/ddgs/searxng/general: autouse fixtures registering all 8 bundled web providers, reset after each test - test_website_policy: same provider registration pattern - test_web_tools_tavily: same pattern across 3 dispatch test classes - Also add is_safe_url/check_website_access mocks where SSRF check blocks example.com (DNS resolution fails in isolated envs) Stale check_fn cache: - test_kanban_tools: invalidate_check_fn_cache() + _clear_tool_defs_cache() in both kanban guidance tests (prior test cached False for kanban_show) - test_discord_tool: cache invalidation in setup/teardown - test_homeassistant_tool: invalidate_check_fn_cache() before registry queries Module-level state pollution: - test_auxiliary_client: autouse fixture clearing _aux_unhealthy_until cache - test_skill_commands: set_session_vars() instead of patch.dict(os.environ) (ContextVar takes precedence over os.environ) - test_dm_topics: overwrite sys.modules + separate telegram.constants mock + force-reimport of gateway.platforms.telegram - test_terminal_tool_requirements: removed duplicate class declaration, autouse _clear_caches fixture * change(tests): run_tests.sh explicitly includes env vars instead of manually dropping some vars, now we just only include some * fix(tests): 5 more isolation/NixOS fixes - test_approval_plugin_hooks: isolate HERMES_HOME so real user's command_allowlist doesn't short-circuit the approval path - test_google_chat: skipif when Platform.GOOGLE_CHAT not in enum (feature not merged on this branch) - test_write_deny: test systemd prefix against tmp_path instead of /etc/systemd which resolves to /nix/store on NixOS - test_pty_bridge: use shutil.which('cat') instead of /bin/cat (doesn't exist on NixOS) - profiles.py: rmtree onexc handler chmod's parent dirs too, fixing profile deletion when copytree preserved read-only modes from nix store * fix(tests): clear unhealthy cache in autouse fixture for auxiliary_client * fix(tests): skip send_message when telegram not installed; handle missing worker_id in browser_supervisor * fix: py3.11 rmtree onexc compat + belt-and-suspenders unhealthy cache clear for expired codex test * fix: address PR #29016 review feedback - Remove tracked .pytest-cache/ artifact and add to .gitignore - Fix stale 'xdist worker' comment in conftest.py - Deduplicate web provider registration into tests/tools/conftest.py shared helper (register_all_web_providers), replacing 8 copy-pasted blocks across 6 test files - Update PR description: remove stale recovered-test-files claim, fix worker count to match code (cpu_count*2) * fix: eliminate race in stale-cache achievements test The background scan thread could complete and overwrite _SNAPSHOT_CACHE before evaluate_all() returned the stale data — only 10 fake sessions made the scan finish instantly. Added scan_delay param to _FakeSessionDB and set it to 2s in the stale-cache test so the background thread can't win the race.
2026-06-07 08:02:23 +00:00 · 2026-05-21 07:10:04 -04:00 · 2026-05-21 07:10:04 -04:00 · 48be2e0e4d
commit 48be2e0e4d
parent 87d9239009
35 changed files with 1694 additions and 582 deletions
--- a/scripts/run_tests.sh
+++ b/scripts/run_tests.sh
@ -3,29 +3,36 @@
 # `pytest` directly to guarantee your local run matches CI behavior.
 #
 # What this script enforces:
-#   * -n 4 xdist workers (CI has 4 cores; -n auto diverges locally)
+#   * Per-file isolation via scripts/run_tests_parallel.py — each test
+#     file runs in its own freshly-spawned `python -m pytest <file>`
+#     subprocess. No xdist, no shared workers, no module-level leakage
+#     between files.
 #   * TZ=UTC, LANG=C.UTF-8, PYTHONHASHSEED=0 (deterministic)
-#   * Credential env vars blanked (conftest.py also does this, but this
-#     is belt-and-suspenders for anyone running `pytest` outside of
-#     our conftest path — e.g. calling pytest on a single file)
-#   * Proper venv activation
+#   * Env vars blanked (conftest.py also does this, but this
+#     is belt-and-suspenders for anyone running pytest outside our
+#     conftest path — e.g. on a single file)
+#   * Proper venv activation (probes .venv, venv, then ~/.hermes/...)
 #
 # Usage:
-#   scripts/run_tests.sh                     # full suite
-#   scripts/run_tests.sh tests/agent/        # one directory
-#   scripts/run_tests.sh tests/agent/test_foo.py::TestClass::test_method
-#   scripts/run_tests.sh --tb=long -v        # pass-through pytest args
+#   scripts/run_tests.sh                            # full suite
+#   scripts/run_tests.sh -j 4                       # cap parallelism
+#   scripts/run_tests.sh tests/agent/               # discover only here
+#   scripts/run_tests.sh tests/agent/ tests/acp/    # multiple roots
+#   scripts/run_tests.sh tests/foo.py               # single file
+#   scripts/run_tests.sh tests/foo.py -- --tb=long  # path + pytest args
+#   scripts/run_tests.sh -- -v --tb=long            # pytest args only
+#
+# Everything after a literal '--' is passed through to each per-file
+# pytest invocation. Positional path arguments before '--' override
+# the default discovery root (tests/).

 set -euo pipefail

 # ── Locate repo root ────────────────────────────────────────────────────────
-# Works whether this is the main checkout or a worktree.
 SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
 REPO_ROOT="$(cd "$SCRIPT_DIR/.." && pwd)"

 # ── Activate venv ───────────────────────────────────────────────────────────
-# Prefer a .venv in the current tree, fall back to the main checkout's venv
-# (useful for worktrees where we don't always duplicate the venv).
 VENV=""
 for candidate in "$REPO_ROOT/.venv" "$REPO_ROOT/venv" "$HOME/.hermes/hermes-agent/venv"; do
  if [ -f "$candidate/bin/activate" ]; then
@ -41,94 +48,31 @@ fi

 PYTHON="$VENV/bin/python"

-# ── Ensure pytest-split is installed (required for shard-equivalent runs) ──
-if ! "$PYTHON" -c "import pytest_split" 2>/dev/null; then
-  echo "→ installing pytest-split into $VENV"
-  if command -v uv >/dev/null 2>&1; then
-    uv pip install --python "$PYTHON" --quiet "pytest-split>=0.9,<1"
-  elif "$PYTHON" -m pip --version >/dev/null 2>&1; then
-    "$PYTHON" -m pip install --quiet "pytest-split>=0.9,<1"
-  else
-    echo "error: neither uv nor pip is available in $VENV — pytest-split is missing" >&2
-    echo "  fix: run  uv pip install -e \".[dev]\"  from $REPO_ROOT" >&2
-    exit 1
-  fi
-fi

-# ── Hermetic environment ────────────────────────────────────────────────────
-# Mirror what CI does in .github/workflows/tests.yml + what conftest.py does.
-# Unset every credential-shaped var currently in the environment.
-while IFS='=' read -r name _; do
-  case "$name" in
-    *_API_KEY|*_TOKEN|*_SECRET|*_PASSWORD|*_CREDENTIALS|*_ACCESS_KEY| \
-    *_SECRET_ACCESS_KEY|*_PRIVATE_KEY|*_OAUTH_TOKEN|*_WEBHOOK_SECRET| \
-    *_ENCRYPT_KEY|*_APP_SECRET|*_CLIENT_SECRET|*_CORP_SECRET|*_AES_KEY| \
-    AWS_ACCESS_KEY_ID|AWS_SECRET_ACCESS_KEY|AWS_SESSION_TOKEN|FAL_KEY| \
-    GH_TOKEN|GITHUB_TOKEN)
-      unset "$name"
-      ;;
-  esac
-done < <(env)
-
-# Unset HERMES_* behavioral vars too.
-unset HERMES_YOLO_MODE HERMES_INTERACTIVE HERMES_QUIET HERMES_TOOL_PROGRESS \
-      HERMES_TOOL_PROGRESS_MODE HERMES_MAX_ITERATIONS HERMES_SESSION_PLATFORM \
-      HERMES_SESSION_CHAT_ID HERMES_SESSION_CHAT_NAME HERMES_SESSION_THREAD_ID \
-      HERMES_SESSION_SOURCE HERMES_SESSION_KEY HERMES_GATEWAY_SESSION \
-      HERMES_CRON_SESSION \
-      HERMES_PLATFORM HERMES_INFERENCE_PROVIDER HERMES_MANAGED HERMES_DEV \
-      HERMES_CONTAINER HERMES_EPHEMERAL_SYSTEM_PROMPT HERMES_TIMEZONE \
-      HERMES_REDACT_SECRETS HERMES_BACKGROUND_NOTIFICATIONS HERMES_EXEC_ASK \
-      HERMES_HOME_MODE 2>/dev/null || true
-
-# Pin deterministic runtime.
-export TZ=UTC
-export LANG=C.UTF-8
-export LC_ALL=C.UTF-8
-export PYTHONHASHSEED=0
-
-# ── Live-gateway test guard (developer machines) ────────────────────────────
-# If a system-wide hermes pytest_live_guard plugin is installed at
-# $HOME/.hermes/pytest_live_guard.py, force-load it here so every test run
-# from this script gets the protection regardless of which worktree is
-# checked out (in-tree tests/conftest.py guard may be missing on stale
-# branches). Harmless on CI / fresh machines that don't have the file.
+# ── Live-gateway plugin (computed before we drop env) ───────────────────────
+EXTRA_PYTHONPATH=""
+EXTRA_PYTEST_PLUGINS=""
 if [ -f "$HOME/.hermes/pytest_live_guard.py" ]; then
-  case ":${PYTHONPATH:-}:" in
-    *":$HOME/.hermes:"*) ;;
-    *) export PYTHONPATH="${PYTHONPATH:+$PYTHONPATH:}$HOME/.hermes" ;;
-  esac
-  if [[ ",${PYTEST_PLUGINS:-}," != *,pytest_live_guard,* ]]; then
-    export PYTEST_PLUGINS="${PYTEST_PLUGINS:+$PYTEST_PLUGINS,}pytest_live_guard"
-  fi
+  EXTRA_PYTHONPATH="$HOME/.hermes"
+  EXTRA_PYTEST_PLUGINS="pytest_live_guard"
 fi

-# ── Worker count ────────────────────────────────────────────────────────────
-# CI uses `-n auto` on ubuntu-latest which gives 4 workers. A 20-core
-# workstation with `-n auto` gets 20 workers and exposes test-ordering
-# flakes that CI will never see. Pin to 4 so local matches CI.
-WORKERS="${HERMES_TEST_WORKERS:-4}"

-# ── Run pytest ──────────────────────────────────────────────────────────────
+# ── Run in hermetic env ──────────────────────────────────────────────────────
+# env -i: start with empty environment, opt-in only what we need.
+# No credential var can leak — you'd have to explicitly add it here.
+echo "▶ running per-file parallel test suite via run_tests_parallel.py"
+echo "  (TZ=UTC LANG=C.UTF-8 PYTHONHASHSEED=0; clean env)"
+
 cd "$REPO_ROOT"

-# If the first argument starts with `-` treat all args as pytest flags;
-# otherwise treat them as test paths.
-ARGS=("$@")
-
-echo "▶ running pytest with $WORKERS workers, hermetic env, in $REPO_ROOT"
-echo "  (TZ=UTC LANG=C.UTF-8 PYTHONHASHSEED=0; all credential env vars unset)"
-
-# -o "addopts=" clears pyproject.toml's `-n auto` so our -n wins.
-# We re-add --timeout/--timeout-method here because pyproject.toml's
-# addopts is wiped above. The 60s cap is essential: see pyproject.toml
-# for why (suite deadlocks at session teardown without it).
-exec "$PYTHON" -m pytest \
-  -o "addopts=" \
-  -n "$WORKERS" \
-  --timeout=30 \
-  --timeout-method=signal \
-  --ignore=tests/integration \
-  --ignore=tests/e2e \
-  -m "not integration" \
-  "${ARGS[@]}"
+exec env -i \
+  PATH="$PATH" \
+  HOME="$HOME" \
+  TZ=UTC \
+  LANG=C.UTF-8 \
+  LC_ALL=C.UTF-8 \
+  PYTHONHASHSEED=0 \
+  ${EXTRA_PYTHONPATH:+PYTHONPATH="$EXTRA_PYTHONPATH"} \
+  ${EXTRA_PYTEST_PLUGINS:+PYTEST_PLUGINS="$EXTRA_PYTEST_PLUGINS"} \
+  "$PYTHON" "$SCRIPT_DIR/run_tests_parallel.py" "$@"