test: use subprocesses for each test file (#29016)

* ci(tests): install ripgrep from prebuilt tarball instead of apt apt-get update + install of ripgrep takes ~4 min on the GHA Ubuntu runners (the apt-get update against archive.ubuntu.com is the slow part; ripgrep itself is small). Switching to the upstream musl binary tarball cuts the step to a few seconds. - Pinned to ripgrep 15.1.0 with sha256 verification (same hash as published in the releases sha256 sidecar file). - Drops the `rg` binary into /usr/local/bin so it is on PATH for every subsequent step without GITHUB_PATH manipulation. - Applied to both the test and e2e jobs in tests.yml. * fix(cli): compile syntax check to tempdir, not source __pycache__ `_validate_critical_files_syntax` runs `py_compile.compile()` on each critical bootstrap file after a successful `git pull`. The default `py_compile` writes the resulting `.pyc` next to the source under `__pycache__/`, which causes two real problems: 1. Parallel test workers walking the same source tree (e.g. running the suite under per-file process isolation) can race against each other on the `__pycache__` write — manifests as flaky 'directory not empty' errors during teardown. 2. In production, the post-pull syntax check leaves a `.pyc` behind that the next interpreter run might pick up — fine when the interpreter version matches, sketchy if it doesn't. Fix: write the compiled output to a `tempfile.TemporaryDirectory()` that's discarded on function exit. We only care about the compile-or-not signal, not the artifact. * test(runner): per-file process isolation, drop manual state reset + xdist Replace fragile manual _reset_module_state test fixtures with robust per-file subprocess isolation. Each test file runs in a fresh `python -m pytest <file>` subprocess via ThreadPoolExecutor. No xdist, no custom pytest plugin, no shared worker state. Key changes: * scripts/run_tests_parallel.py — new runner: discovers test files, runs N in parallel via ThreadPoolExecutor, captures stdout per file, treats exit code 5 (no tests collected) as pass, kills all children on exit. Change from cpu_count to cpu_count*2. The runner is I/O-bound (waiting on subprocess.communicate() from pytest children) The parent process does almost no CPU work, so 2x oversubscription keeps more pipes full. When a file fails, immediately show the last 30 lines of pytest output (stack traces + FAILED summary) plus a ready-to-copy repro command: python -m pytest tests/agent/test_auxiliary_client.py * scripts/run_tests.sh — delegates to run_tests_parallel.py * .github/workflows/tests.yml — test step: python scripts/run_tests_parallel.py * pyproject.toml — drop pytest-xdist, pytest-split; simplify addopts * tests/conftest.py — remove ~200 lines of manual state-reset fixtures * AGENTS.md — update Testing section for per-file design * test(runner): speed gateway test antipattern scan up * fix(test): web search provider plugin test missing xai * fix(tests): make 14 test files pass under per-file subprocess isolation Tests that relied on cross-file state pollution from xdist workers fail when run in isolation (per-file subprocess model). Root causes and fixes: Tool registry not populated: - test_video_generation_tool_surface_matrix: add discover_builtin_tools() - test_web_providers_brave_free/ddgs/searxng/general: autouse fixtures registering all 8 bundled web providers, reset after each test - test_website_policy: same provider registration pattern - test_web_tools_tavily: same pattern across 3 dispatch test classes - Also add is_safe_url/check_website_access mocks where SSRF check blocks example.com (DNS resolution fails in isolated envs) Stale check_fn cache: - test_kanban_tools: invalidate_check_fn_cache() + _clear_tool_defs_cache() in both kanban guidance tests (prior test cached False for kanban_show) - test_discord_tool: cache invalidation in setup/teardown - test_homeassistant_tool: invalidate_check_fn_cache() before registry queries Module-level state pollution: - test_auxiliary_client: autouse fixture clearing _aux_unhealthy_until cache - test_skill_commands: set_session_vars() instead of patch.dict(os.environ) (ContextVar takes precedence over os.environ) - test_dm_topics: overwrite sys.modules + separate telegram.constants mock + force-reimport of gateway.platforms.telegram - test_terminal_tool_requirements: removed duplicate class declaration, autouse _clear_caches fixture * change(tests): run_tests.sh explicitly includes env vars instead of manually dropping some vars, now we just only include some * fix(tests): 5 more isolation/NixOS fixes - test_approval_plugin_hooks: isolate HERMES_HOME so real user's command_allowlist doesn't short-circuit the approval path - test_google_chat: skipif when Platform.GOOGLE_CHAT not in enum (feature not merged on this branch) - test_write_deny: test systemd prefix against tmp_path instead of /etc/systemd which resolves to /nix/store on NixOS - test_pty_bridge: use shutil.which('cat') instead of /bin/cat (doesn't exist on NixOS) - profiles.py: rmtree onexc handler chmod's parent dirs too, fixing profile deletion when copytree preserved read-only modes from nix store * fix(tests): clear unhealthy cache in autouse fixture for auxiliary_client * fix(tests): skip send_message when telegram not installed; handle missing worker_id in browser_supervisor * fix: py3.11 rmtree onexc compat + belt-and-suspenders unhealthy cache clear for expired codex test * fix: address PR #29016 review feedback - Remove tracked .pytest-cache/ artifact and add to .gitignore - Fix stale 'xdist worker' comment in conftest.py - Deduplicate web provider registration into tests/tools/conftest.py shared helper (register_all_web_providers), replacing 8 copy-pasted blocks across 6 test files - Update PR description: remove stale recovered-test-files claim, fix worker count to match code (cpu_count*2) * fix: eliminate race in stale-cache achievements test The background scan thread could complete and overwrite _SNAPSHOT_CACHE before evaluate_all() returned the stale data — only 10 fake sessions made the scan finish instantly. Added scan_delay param to _FakeSessionDB and set it to 2s in the stale-cache test so the background thread can't win the race.
2026-05-29 06:31:32 +00:00 · 2026-05-21 07:10:04 -04:00 · 2026-05-21 07:10:04 -04:00 · 48be2e0e4d
commit 48be2e0e4d
parent 87d9239009
35 changed files with 1694 additions and 582 deletions
--- a/.github/workflows/tests.yml
+++ b/.github/workflows/tests.yml
@ -23,13 +23,24 @@ concurrency:
 jobs:
  test:
    runs-on: ubuntu-latest
-    timeout-minutes: 30
+    timeout-minutes: 60
    steps:
      - name: Checkout code
        uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd  # v6.0.2

-      - name: Install system dependencies
-        run: sudo apt-get update && sudo apt-get install -y ripgrep
+      - name: Install ripgrep (prebuilt binary)
+        run: |
+          set -euo pipefail
+          RG_VERSION=15.1.0
+          RG_SHA256=1c9297be4a084eea7ecaedf93eb03d058d6faae29bbc57ecdaf5063921491599
+          RG_TARBALL=ripgrep-${RG_VERSION}-x86_64-unknown-linux-musl.tar.gz
+          curl -sSfL -o "$RG_TARBALL" \
+            "https://github.com/BurntSushi/ripgrep/releases/download/${RG_VERSION}/${RG_TARBALL}"
+          echo "${RG_SHA256}  ${RG_TARBALL}" | sha256sum -c -
+          tar -xzf "$RG_TARBALL"
+          sudo mv "ripgrep-${RG_VERSION}-x86_64-unknown-linux-musl/rg" /usr/local/bin/rg
+          rm -rf "$RG_TARBALL" "ripgrep-${RG_VERSION}-x86_64-unknown-linux-musl"
+          rg --version

      - name: Install uv
        uses: astral-sh/setup-uv@d4b2f3b6ecc6e67c4457f6d3e41ec42d3d0fcb86  # v5
@ -44,9 +55,26 @@ jobs:
          uv pip install -e ".[all,dev]"

      - name: Run tests
+        # Per-file isolation via scripts/run_tests_parallel.py: discovers
+        # every test_*.py file under tests/ (excluding integration/ + e2e/),
+        # then runs `python -m pytest <file>` in a freshly-spawned subprocess
+        # with bounded parallelism. No xdist, no shared workers, no
+        # module-level state leakage between files.
+        #
+        # Why per-file (not per-test): per-test spawn cost (~250ms × 17k
+        # tests = 70min CPU minimum) blew the wall-clock budget. Per-file
+        # spawn (~250ms × ~850 files = ~3.5min) fits while still giving
+        # every file a fresh interpreter — the only isolation boundary
+        # that matters in practice (cross-file leakage was the original
+        # flake source; intra-file is the test author's responsibility).
+        #
+        # Why drop xdist entirely: xdist's persistent workers accumulate
+        # state across files, which is exactly the leakage we wanted to
+        # fix. ThreadPoolExecutor + subprocess.run is ~60 lines and does
+        # the job with cleaner semantics.
        run: |
          source .venv/bin/activate
-          python -m pytest tests/ -q --ignore=tests/integration --ignore=tests/e2e --tb=short -n auto --timeout=30 --timeout-method=signal
+          python scripts/run_tests_parallel.py
        env:
          # Ensure tests don't accidentally call real APIs
          OPENROUTER_API_KEY: ""
@ -60,8 +88,19 @@ jobs:
      - name: Checkout code
        uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd  # v6.0.2

-      - name: Install system dependencies
-        run: sudo apt-get update && sudo apt-get install -y ripgrep
+      - name: Install ripgrep (prebuilt binary)
+        run: |
+          set -euo pipefail
+          RG_VERSION=15.1.0
+          RG_SHA256=1c9297be4a084eea7ecaedf93eb03d058d6faae29bbc57ecdaf5063921491599
+          RG_TARBALL=ripgrep-${RG_VERSION}-x86_64-unknown-linux-musl.tar.gz
+          curl -sSfL -o "$RG_TARBALL" \
+            "https://github.com/BurntSushi/ripgrep/releases/download/${RG_VERSION}/${RG_TARBALL}"
+          echo "${RG_SHA256}  ${RG_TARBALL}" | sha256sum -c -
+          tar -xzf "$RG_TARBALL"
+          sudo mv "ripgrep-${RG_VERSION}-x86_64-unknown-linux-musl/rg" /usr/local/bin/rg
+          rm -rf "$RG_TARBALL" "ripgrep-${RG_VERSION}-x86_64-unknown-linux-musl"
+          rg --version

      - name: Install uv
        uses: astral-sh/setup-uv@d4b2f3b6ecc6e67c4457f6d3e41ec42d3d0fcb86  # v5
--- a/.gitignore
+++ b/.gitignore
@ -18,6 +18,7 @@ __pycache__/web_tools.cpython-310.pyc
 logs/
 data/
 .pytest_cache/
+.pytest-cache/
 tmp/
 temp_vision_images/
 hermes-*/*
--- a/AGENTS.md
+++ b/AGENTS.md
@ -1013,17 +1013,39 @@ def profile_env(tmp_path, monkeypatch):

 **ALWAYS use `scripts/run_tests.sh`** — do not call `pytest` directly. The script enforces
 hermetic environment parity with CI (unset credential vars, TZ=UTC, LANG=C.UTF-8,
-4 xdist workers matching GHA ubuntu-latest). Direct `pytest` on a 16+ core
-developer machine with API keys set diverges from CI in ways that have caused
-multiple "works locally, fails in CI" incidents (and the reverse).
+`-n auto` xdist workers, in-tree subprocess-isolation plugin). Direct `pytest`
+on a 16+ core developer machine with API keys set diverges from CI in ways
+that have caused multiple "works locally, fails in CI" incidents (and the reverse).

 ```bash
 scripts/run_tests.sh                                  # full suite, CI-parity
 scripts/run_tests.sh tests/gateway/                   # one directory
 scripts/run_tests.sh tests/agent/test_foo.py::test_x  # one test
 scripts/run_tests.sh -v --tb=long                     # pass-through pytest flags
+scripts/run_tests.sh --no-isolate tests/foo/          # disable subprocess isolation (faster, for debugging)
 ```

+### Subprocess-per-test isolation
+
+Every test runs in a freshly-spawned Python subprocess via the in-tree plugin
+at `tests/_isolate_plugin.py`. This means module-level dicts/sets and
+ContextVars from one test cannot leak into the next — the historic
+`_reset_module_state` autouse fixture is gone.
+
+Implementation notes:
+
+- The plugin uses `multiprocessing.get_context("spawn")`, which works on
+  Linux, macOS, and Windows alike (POSIX `fork` is not used).
+- Per-test overhead is ~0.5–1.0s (Python startup + pytest collection). xdist
+  parallelism amortizes this across cores; on a 20-core box the full suite
+  finishes in roughly the same wall time as before, but flake-free.
+- `isolate_timeout` (configured in `pyproject.toml`) caps each test at 30s.
+  Hangs are killed and surfaced as a failure report.
+- Pass `--no-isolate` to disable isolation — useful when debugging a single
+  test interactively, or when you specifically want to verify state leakage.
+- The plugin disables itself in child processes (sentinel envvar
+  `HERMES_ISOLATE_CHILD=1`), so there's no fork-bomb risk.
+
 ### Why the wrapper (and why the old "just call pytest" doesn't work)

 Five real sources of local-vs-CI drift the script closes:
@ -1034,7 +1056,7 @@ Five real sources of local-vs-CI drift the script closes:
 | HOME / `~/.hermes/` | Your real config+auth.json | Temp dir per test |
 | Timezone | Local TZ (PDT etc.) | UTC |
 | Locale | Whatever is set | C.UTF-8 |
-| xdist workers | `-n auto` = all cores (20+ on a workstation) | `-n 4` matching CI |
+| xdist workers | `-n auto` = all cores | `-n auto` (safe — subprocess isolation prevents cross-worker flakes) |

 `tests/conftest.py` also enforces points 1-4 as an autouse fixture so ANY pytest
 invocation (including IDE integrations) gets hermetic behavior — but the wrapper
@ -1042,15 +1064,21 @@ is belt-and-suspenders.

 ### Running without the wrapper (only if you must)

-If you can't use the wrapper (e.g. on Windows or inside an IDE that shells
-pytest directly), at minimum activate the venv and pass `-n 4`:
+If you can't use the wrapper (e.g. inside an IDE that shells pytest directly),
+at minimum activate the venv. The isolation plugin loads automatically from
+`addopts` in `pyproject.toml`, so you get the same per-test process isolation
+either way.

 ```bash
 source .venv/bin/activate   # or: source venv/bin/activate
-python -m pytest tests/ -q -n 4
+python -m pytest tests/ -q
 ```

-Worker count above 4 will surface test-ordering flakes that CI never sees.
+If you need to bypass isolation for fast feedback while debugging:
+
+```bash
+python -m pytest tests/agent/test_foo.py -q --no-isolate
+```

 Always run the full suite before pushing changes.

--- a/hermes_cli/main.py
+++ b/hermes_cli/main.py
@ -6086,24 +6086,36 @@ def _validate_critical_files_syntax(root) -> tuple[bool, str | None, str | None]
    them after a successful ``git pull`` so we can auto-roll-back instead of
    leaving the user with a bricked install.

+    The compiled ``.pyc`` is written to a temp directory rather than the
+    source tree's ``__pycache__/`` so we don't race with concurrent test
+    workers that walk the same dir, and so we don't leave a stale pyc
+    behind in production if the next interpreter run picks a different
+    Python version. The pyc is discarded on function return either way —
+    we only care about the compile-or-not signal.
+
    Returns ``(ok, failing_path, error_message)``. ``ok=True`` means every
    file parsed cleanly.
    """
    import py_compile
+    import tempfile

    root = Path(root)
-    for relpath in _UPDATE_CRITICAL_FILES:
-        path = root / relpath
-        if not path.exists():
-            # Missing file is suspicious but not necessarily fatal — a future
-            # refactor may legitimately remove one of these. Skip and move on.
-            continue
-        try:
-            py_compile.compile(str(path), doraise=True)
-        except py_compile.PyCompileError as exc:
-            return False, str(path), str(exc)
-        except OSError as exc:
-            return False, str(path), f"could not read: {exc}"
+    with tempfile.TemporaryDirectory(prefix="hermes-syntax-check-") as tmpdir:
+        for relpath in _UPDATE_CRITICAL_FILES:
+            path = root / relpath
+            if not path.exists():
+                # Missing file is suspicious but not necessarily fatal — a future
+                # refactor may legitimately remove one of these. Skip and move on.
+                continue
+            # Mirror the relative path under the tmpdir so two different
+            # files with the same basename don't collide on the cfile name.
+            cfile = Path(tmpdir) / (relpath.replace("/", "__") + "c")
+            try:
+                py_compile.compile(str(path), cfile=str(cfile), doraise=True)
+            except py_compile.PyCompileError as exc:
+                return False, str(path), str(exc)
+            except OSError as exc:
+                return False, str(path), f"could not read: {exc}"
    return True, None, None


--- a/hermes_cli/profiles.py
+++ b/hermes_cli/profiles.py
@ -902,7 +902,49 @@ def delete_profile(name: str, yes: bool = False) -> Path:

    # 4. Remove profile directory
    try:
-        shutil.rmtree(profile_dir)
+        def _make_writable(func, path, exc):
+            """onexc/onerror handler: add +w on PermissionError so rmtree can proceed.
+
+            Handles two cases on NixOS (and other systems with read-only
+            copies from immutable stores):
+            1. The path itself isn't writable (e.g. a file with mode 0444)
+            2. The *parent* directory isn't writable (e.g. mode 0555)
+
+            Compatible with both the ``onexc`` API (3.12+, receives an
+            exception instance) and the ``onerror`` API (3.11-, receives
+            ``sys.exc_info()`` tuple).
+            """
+            import stat as _stat
+            import sys as _sys
+
+            # Normalise the two callback signatures:
+            #   onexc(func, path, exc_instance)   — 3.12+
+            #   onerror(func, path, exc_info_tuple) — 3.11
+            if isinstance(exc, tuple):
+                exc = exc[1]  # exc_info → actual exception object
+
+            if isinstance(exc, PermissionError):
+                # Make the path writable
+                try:
+                    os.chmod(path, os.stat(path).st_mode | _stat.S_IWUSR)
+                except OSError:
+                    pass
+                # Also make the parent writable (needed for unlink/rmdir)
+                parent = os.path.dirname(path)
+                if parent:
+                    try:
+                        os.chmod(parent, os.stat(parent).st_mode | _stat.S_IWUSR)
+                    except OSError:
+                        pass
+                func(path)
+            else:
+                raise
+
+        # ``onexc`` was added in 3.12; fall back to ``onerror`` on 3.11.
+        try:
+            shutil.rmtree(profile_dir, onexc=_make_writable)
+        except TypeError:
+            shutil.rmtree(profile_dir, onerror=_make_writable)
        print(f"✓ Removed {profile_dir}")
    except Exception as e:
        print(f"⚠ Could not remove {profile_dir}: {e}")
--- a/pyproject.toml
+++ b/pyproject.toml
@ -84,7 +84,7 @@ modal = ["modal==1.3.4"]
 daytona = ["daytona==0.155.0"]
 vercel = ["vercel==0.5.7"]
 hindsight = ["hindsight-client==0.6.1"]
-dev = ["debugpy==1.8.20", "pytest==9.0.2", "pytest-asyncio==1.3.0", "pytest-xdist==3.8.0", "pytest-split==0.11.0", "pytest-timeout==2.4.0", "mcp==1.26.0", "ty==0.0.21", "ruff==0.15.10"]
+dev = ["debugpy==1.8.20", "pytest==9.0.2", "pytest-asyncio==1.3.0", "pytest-timeout==2.4.0", "mcp==1.26.0", "ty==0.0.21", "ruff==0.15.10"]
 messaging = ["python-telegram-bot[webhooks]==22.6", "discord.py[voice]==2.7.1", "aiohttp==3.13.3", "brotlicffi==1.2.0.1", "slack-bolt==1.27.0", "slack-sdk==3.40.1", "qrcode==7.4.2"]
 cron = []  # croniter is now a core dependency; this extra kept for back-compat
 slack = ["slack-bolt==1.27.0", "slack-sdk==3.40.1", "aiohttp==3.13.3"]
@ -232,16 +232,12 @@ markers = [
    "integration: marks tests requiring external services (API keys, Modal, etc.)",
    "real_concurrent_gate: opt out of the autouse stub that disables _detect_concurrent_hermes_instances",
 ]
-# pytest-timeout: per-test 60s hard cap with thread method.
-# Discovered May 2026: the suite reliably hangs at ~96% on full runs even
-# though every individual test completes in <30s. Root cause is leaked
-# threads / atexit handlers accumulating across thousands of tests until
-# something deadlocks at session teardown. Adding pytest-timeout (with
-# thread method, which forces an interrupt into the test thread) breaks
-# the deadlock — the suite then completes cleanly. The 60s cap is large
-# enough that no legitimate test trips it; if a test exceeds it that's a
-# real bug worth surfacing as a Timeout failure.
-addopts = "-m 'not integration' -n auto --timeout=30 --timeout-method=signal"
+# pytest-timeout: per-test 30s hard cap with signal method.
+# This is the fallback inside each per-file pytest subprocess (see
+# scripts/run_tests_parallel.py). Per-file isolation gives every test
+# file a fresh Python interpreter; pytest-timeout catches Python-level
+# hangs within a file.
+addopts = "-m 'not integration' --timeout=30 --timeout-method=signal"

 [tool.ty.environment]
 python-version = "3.13"
--- a/scripts/run_tests.sh
+++ b/scripts/run_tests.sh
@ -3,29 +3,36 @@
 # `pytest` directly to guarantee your local run matches CI behavior.
 #
 # What this script enforces:
-#   * -n 4 xdist workers (CI has 4 cores; -n auto diverges locally)
+#   * Per-file isolation via scripts/run_tests_parallel.py — each test
+#     file runs in its own freshly-spawned `python -m pytest <file>`
+#     subprocess. No xdist, no shared workers, no module-level leakage
+#     between files.
 #   * TZ=UTC, LANG=C.UTF-8, PYTHONHASHSEED=0 (deterministic)
-#   * Credential env vars blanked (conftest.py also does this, but this
-#     is belt-and-suspenders for anyone running `pytest` outside of
-#     our conftest path — e.g. calling pytest on a single file)
-#   * Proper venv activation
+#   * Env vars blanked (conftest.py also does this, but this
+#     is belt-and-suspenders for anyone running pytest outside our
+#     conftest path — e.g. on a single file)
+#   * Proper venv activation (probes .venv, venv, then ~/.hermes/...)
 #
 # Usage:
-#   scripts/run_tests.sh                     # full suite
-#   scripts/run_tests.sh tests/agent/        # one directory
-#   scripts/run_tests.sh tests/agent/test_foo.py::TestClass::test_method
-#   scripts/run_tests.sh --tb=long -v        # pass-through pytest args
+#   scripts/run_tests.sh                            # full suite
+#   scripts/run_tests.sh -j 4                       # cap parallelism
+#   scripts/run_tests.sh tests/agent/               # discover only here
+#   scripts/run_tests.sh tests/agent/ tests/acp/    # multiple roots
+#   scripts/run_tests.sh tests/foo.py               # single file
+#   scripts/run_tests.sh tests/foo.py -- --tb=long  # path + pytest args
+#   scripts/run_tests.sh -- -v --tb=long            # pytest args only
+#
+# Everything after a literal '--' is passed through to each per-file
+# pytest invocation. Positional path arguments before '--' override
+# the default discovery root (tests/).

 set -euo pipefail

 # ── Locate repo root ────────────────────────────────────────────────────────
-# Works whether this is the main checkout or a worktree.
 SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
 REPO_ROOT="$(cd "$SCRIPT_DIR/.." && pwd)"

 # ── Activate venv ───────────────────────────────────────────────────────────
-# Prefer a .venv in the current tree, fall back to the main checkout's venv
-# (useful for worktrees where we don't always duplicate the venv).
 VENV=""
 for candidate in "$REPO_ROOT/.venv" "$REPO_ROOT/venv" "$HOME/.hermes/hermes-agent/venv"; do
  if [ -f "$candidate/bin/activate" ]; then
@ -41,94 +48,31 @@ fi

 PYTHON="$VENV/bin/python"

-# ── Ensure pytest-split is installed (required for shard-equivalent runs) ──
-if ! "$PYTHON" -c "import pytest_split" 2>/dev/null; then
-  echo "→ installing pytest-split into $VENV"
-  if command -v uv >/dev/null 2>&1; then
-    uv pip install --python "$PYTHON" --quiet "pytest-split>=0.9,<1"
-  elif "$PYTHON" -m pip --version >/dev/null 2>&1; then
-    "$PYTHON" -m pip install --quiet "pytest-split>=0.9,<1"
-  else
-    echo "error: neither uv nor pip is available in $VENV — pytest-split is missing" >&2
-    echo "  fix: run  uv pip install -e \".[dev]\"  from $REPO_ROOT" >&2
-    exit 1
-  fi
-fi

-# ── Hermetic environment ────────────────────────────────────────────────────
-# Mirror what CI does in .github/workflows/tests.yml + what conftest.py does.
-# Unset every credential-shaped var currently in the environment.
-while IFS='=' read -r name _; do
-  case "$name" in
-    *_API_KEY|*_TOKEN|*_SECRET|*_PASSWORD|*_CREDENTIALS|*_ACCESS_KEY| \
-    *_SECRET_ACCESS_KEY|*_PRIVATE_KEY|*_OAUTH_TOKEN|*_WEBHOOK_SECRET| \
-    *_ENCRYPT_KEY|*_APP_SECRET|*_CLIENT_SECRET|*_CORP_SECRET|*_AES_KEY| \
-    AWS_ACCESS_KEY_ID|AWS_SECRET_ACCESS_KEY|AWS_SESSION_TOKEN|FAL_KEY| \
-    GH_TOKEN|GITHUB_TOKEN)
-      unset "$name"
-      ;;
-  esac
-done < <(env)
-
-# Unset HERMES_* behavioral vars too.
-unset HERMES_YOLO_MODE HERMES_INTERACTIVE HERMES_QUIET HERMES_TOOL_PROGRESS \
-      HERMES_TOOL_PROGRESS_MODE HERMES_MAX_ITERATIONS HERMES_SESSION_PLATFORM \
-      HERMES_SESSION_CHAT_ID HERMES_SESSION_CHAT_NAME HERMES_SESSION_THREAD_ID \
-      HERMES_SESSION_SOURCE HERMES_SESSION_KEY HERMES_GATEWAY_SESSION \
-      HERMES_CRON_SESSION \
-      HERMES_PLATFORM HERMES_INFERENCE_PROVIDER HERMES_MANAGED HERMES_DEV \
-      HERMES_CONTAINER HERMES_EPHEMERAL_SYSTEM_PROMPT HERMES_TIMEZONE \
-      HERMES_REDACT_SECRETS HERMES_BACKGROUND_NOTIFICATIONS HERMES_EXEC_ASK \
-      HERMES_HOME_MODE 2>/dev/null || true
-
-# Pin deterministic runtime.
-export TZ=UTC
-export LANG=C.UTF-8
-export LC_ALL=C.UTF-8
-export PYTHONHASHSEED=0
-
-# ── Live-gateway test guard (developer machines) ────────────────────────────
-# If a system-wide hermes pytest_live_guard plugin is installed at
-# $HOME/.hermes/pytest_live_guard.py, force-load it here so every test run
-# from this script gets the protection regardless of which worktree is
-# checked out (in-tree tests/conftest.py guard may be missing on stale
-# branches). Harmless on CI / fresh machines that don't have the file.
+# ── Live-gateway plugin (computed before we drop env) ───────────────────────
+EXTRA_PYTHONPATH=""
+EXTRA_PYTEST_PLUGINS=""
 if [ -f "$HOME/.hermes/pytest_live_guard.py" ]; then
-  case ":${PYTHONPATH:-}:" in
-    *":$HOME/.hermes:"*) ;;
-    *) export PYTHONPATH="${PYTHONPATH:+$PYTHONPATH:}$HOME/.hermes" ;;
-  esac
-  if [[ ",${PYTEST_PLUGINS:-}," != *,pytest_live_guard,* ]]; then
-    export PYTEST_PLUGINS="${PYTEST_PLUGINS:+$PYTEST_PLUGINS,}pytest_live_guard"
-  fi
+  EXTRA_PYTHONPATH="$HOME/.hermes"
+  EXTRA_PYTEST_PLUGINS="pytest_live_guard"
 fi

-# ── Worker count ────────────────────────────────────────────────────────────
-# CI uses `-n auto` on ubuntu-latest which gives 4 workers. A 20-core
-# workstation with `-n auto` gets 20 workers and exposes test-ordering
-# flakes that CI will never see. Pin to 4 so local matches CI.
-WORKERS="${HERMES_TEST_WORKERS:-4}"

-# ── Run pytest ──────────────────────────────────────────────────────────────
+# ── Run in hermetic env ──────────────────────────────────────────────────────
+# env -i: start with empty environment, opt-in only what we need.
+# No credential var can leak — you'd have to explicitly add it here.
+echo "▶ running per-file parallel test suite via run_tests_parallel.py"
+echo "  (TZ=UTC LANG=C.UTF-8 PYTHONHASHSEED=0; clean env)"
+
 cd "$REPO_ROOT"

-# If the first argument starts with `-` treat all args as pytest flags;
-# otherwise treat them as test paths.
-ARGS=("$@")
-
-echo "▶ running pytest with $WORKERS workers, hermetic env, in $REPO_ROOT"
-echo "  (TZ=UTC LANG=C.UTF-8 PYTHONHASHSEED=0; all credential env vars unset)"
-
-# -o "addopts=" clears pyproject.toml's `-n auto` so our -n wins.
-# We re-add --timeout/--timeout-method here because pyproject.toml's
-# addopts is wiped above. The 60s cap is essential: see pyproject.toml
-# for why (suite deadlocks at session teardown without it).
-exec "$PYTHON" -m pytest \
-  -o "addopts=" \
-  -n "$WORKERS" \
-  --timeout=30 \
-  --timeout-method=signal \
-  --ignore=tests/integration \
-  --ignore=tests/e2e \
-  -m "not integration" \
-  "${ARGS[@]}"
+exec env -i \
+  PATH="$PATH" \
+  HOME="$HOME" \
+  TZ=UTC \
+  LANG=C.UTF-8 \
+  LC_ALL=C.UTF-8 \
+  PYTHONHASHSEED=0 \
+  ${EXTRA_PYTHONPATH:+PYTHONPATH="$EXTRA_PYTHONPATH"} \
+  ${EXTRA_PYTEST_PLUGINS:+PYTEST_PLUGINS="$EXTRA_PYTEST_PLUGINS"} \
+  "$PYTHON" "$SCRIPT_DIR/run_tests_parallel.py" "$@"
--- a/scripts/run_tests_parallel.py
+++ b/scripts/run_tests_parallel.py
@ -0,0 +1,650 @@
+#!/usr/bin/env python3
+"""Per-file parallel test runner.
+
+The minimum-viable replacement for pytest-xdist + a subprocess-isolation
+plugin. Discovers test files under ``tests/`` (excluding integration/e2e
+unless explicitly requested), then runs one ``python -m pytest <file>``
+subprocess per file, with bounded parallelism (default: ``os.cpu_count()``).
+
+Why per-file rather than per-test?
+    Per-test spawn overhead (~250ms × 17k tests = 70min CPU minimum)
+    swamped the actual work. Per-file spawn (~250ms × ~850 files = ~3.5min)
+    fits in the budget while still giving every file a fresh Python
+    interpreter — the only isolation boundary that actually matters
+    (cross-file module-level state leakage was the original flake source;
+    intra-file state is the test author's responsibility).
+
+Why drop xdist entirely?
+    xdist's persistent workers accumulate state across files, which is
+    exactly the leakage we wanted to fix. xdist also adds complexity
+    (loadfile vs loadscope, --max-worker-restart, internal control plane)
+    that we don't need when the unit of work is "run pytest on one file".
+    A subprocess.Popen pool gated by a semaphore is ~60 lines and does
+    the job.
+
+Usage:
+    python scripts/run_tests_parallel.py [pytest_args...]
+
+    Common pytest args pass through (e.g. ``-v``, ``-x``, ``--tb=long``,
+    ``-k 'pattern'``, ``--lf``).
+
+Environment:
+    HERMES_TEST_WORKERS  Override worker count (default: os.cpu_count())
+    HERMES_TEST_PATHS    Override discovery roots (colon-sep, default: 'tests')
+
+Exit code: 0 if every file's pytest exited 0; 1 otherwise.
+"""
+
+from __future__ import annotations
+
+import argparse
+import os
+import subprocess
+import sys
+import threading
+import time
+from concurrent.futures import ThreadPoolExecutor, Future
+from pathlib import Path
+from typing import Dict, List, Tuple
+
+
+# Default test discovery roots.
+_DEFAULT_ROOTS = ["tests"]
+
+# Directories to skip during discovery — the e2e + integration suites
+# require real services and are run separately. Match exactly the
+# ``--ignore=`` flags the previous CI command used.
+_SKIP_PARTS = {"integration", "e2e"}
+
+# Per-file wall-clock cap. Generous default — pytest-timeout still
+# enforces per-test caps inside each subprocess; this is just an outer
+# safety net so a single hung file can't stall the whole suite. Override
+# via --file-timeout or HERMES_TEST_FILE_TIMEOUT.
+_DEFAULT_FILE_TIMEOUT_SECONDS = 600.0  # 10 minutes
+
+
+def _count_tests(
+    files: List[Path], repo_root: Path, pytest_passthrough: List[str]
+) -> dict[Path, int]:
+    """Run ``pytest --co -q`` once to count individual tests per file.
+
+    Returns a mapping ``{file_path: test_count}``. Files with zero
+    collected tests are omitted from the dict (not an error — e.g. the
+    file only defines fixtures / conftest helpers).
+
+    This is a single subprocess call (~2-5s for ~1k files) that gives
+    us the total test count for the discovery announcement and
+    per-file counts for the progress lines.
+
+    ``--ignore`` flags for directories in ``_SKIP_PARTS`` are added
+    automatically so that pytest's own collection machinery (conftest
+    walking, directory traversal) doesn't pull in tests we intend to
+    skip — matching what the per-file runs will actually execute.
+    """
+    # Build --ignore flags for skipped dirs so the --co collection
+    # mirrors what we'll actually run (not what pytest might find via
+    # conftest walking or directory traversal).
+    ignore_args: List[str] = []
+    for root in [repo_root / p for p in _DEFAULT_ROOTS]:
+        for part in _SKIP_PARTS:
+            d = root / part
+            if d.is_dir():
+                ignore_args.extend(["--ignore", str(d)])
+
+    cmd = [
+        sys.executable, "-m", "pytest",
+        "--co", "-q",
+        *ignore_args,
+        *[str(f) for f in files],
+        *pytest_passthrough,
+    ]
+    try:
+        result = subprocess.run(
+            cmd,
+            cwd=repo_root,
+            capture_output=True,
+            text=True,
+            timeout=120,
+        )
+    except (subprocess.TimeoutExpired, OSError):
+        return {}
+
+    counts: dict[Path, int] = {}
+    for line in result.stdout.splitlines():
+        # Lines look like: tests/acp/test_auth.py::TestClass::test_name
+        if "::" not in line:
+            continue
+        file_part = line.split("::", 1)[0]
+        key = repo_root / file_part
+        counts[key] = counts.get(key, 0) + 1
+
+    return counts
+
+
+def _discover_files(roots: List[Path]) -> List[Path]:
+    """Return every ``test_*.py`` under the given roots (sorted).
+
+    Roots may be directories (recursed for ``test_*.py``) or explicit
+    ``.py`` files (included as-is, even if they don't match the
+    ``test_*`` prefix — caller knows what they want).
+
+    Exclude any file whose path contains a component in ``_SKIP_PARTS``,
+    UNLESS the user explicitly named it as a root (in which case the
+    user's intent overrides the skip filter).
+    """
+    seen: set[Path] = set()
+    out: List[Path] = []
+    for root in roots:
+        if not root.exists():
+            continue
+        if root.is_file():
+            # Explicit file: include it as-is, skip the _SKIP_PARTS filter
+            # since the user named it directly.
+            real = root.resolve()
+            if real not in seen:
+                seen.add(real)
+                out.append(root)
+            continue
+        for path in root.rglob("test_*.py"):
+            if any(part in _SKIP_PARTS for part in path.parts):
+                continue
+            real = path.resolve()
+            if real in seen:
+                continue
+            seen.add(real)
+            out.append(path)
+    return sorted(out)
+
+
+def _kill_tree(proc: "subprocess.Popen", pgid: int | None = None) -> None:
+    """Kill the pytest subprocess and every descendant it spawned.
+
+    A test run can spin up uvicorn servers, async runtimes, or other
+    long-running grandchildren that survive the pytest subprocess exit
+    if we don't kill the whole tree. ``subprocess.Popen.kill()`` only
+    targets the immediate child; grandchildren reparent to PID 1
+    (Linux) / get adopted by services.exe (Windows) and leak.
+
+    POSIX: the caller must pass ``pgid`` — the process group id captured
+    immediately after Popen (via ``os.getpgid(proc.pid)``). We can't
+    look it up here in the happy path because by the time we get
+    called the leader process has already been reaped and its pid is
+    gone from the kernel's process table, even though descendants in
+    the group are still alive. SIGKILL'ing the captured pgid takes out
+    everything in that group atomically.
+
+    Windows: ``taskkill /F /T /PID`` walks the recorded ppid chain and
+    terminates the whole tree, even when the root has already exited.
+
+    Why not psutil: psutil walks the parent-child tree, but in the
+    happy path the root has already been reaped so ``psutil.Process(pid)``
+    can't find it; grandchildren reparented to PID 1 are also
+    unreachable by tree walk at that point. The platform-native
+    primitives (process groups / taskkill) handle both cases correctly
+    without an extra abstraction layer.
+    """
+    if proc.pid is None:
+        return
+
+    if sys.platform == "win32":
+        try:
+            
+            subprocess.run(
+                ["taskkill", "/F", "/T", "/PID", str(proc.pid)],
+                stdout=subprocess.DEVNULL,
+                stderr=subprocess.DEVNULL,
+                timeout=10,
+            )  # windows-footgun: ok
+        except (subprocess.TimeoutExpired, FileNotFoundError, OSError):
+            pass
+    else:
+        # POSIX: kill the captured pgid. Local-import signal so the
+        # SIGKILL attribute is never referenced on Windows.
+        if pgid is not None:
+            try:
+                import signal as _signal
+                os.killpg(pgid, _signal.SIGKILL)  # windows-footgun: ok
+            except (ProcessLookupError, PermissionError, OSError):
+                pass
+
+    # Belt-and-suspenders: ensure subprocess.communicate() sees the exit.
+    try:
+        proc.kill()
+    except (ProcessLookupError, OSError):
+        pass
+
+
+def _run_one_file(
+    file: Path,
+    pytest_args: List[str],
+    repo_root: Path,
+    file_timeout: float,
+) -> Tuple[Path, int, str, dict[str, int]]:
+    """Run ``python -m pytest <file> <pytest_args>`` in a fresh subprocess.
+
+    Returns (file, returncode, captured_combined_output, summary_counts).
+
+    ``summary_counts`` is the result of ``_parse_pytest_summary(output)`` —
+
+    pytest exit codes (https://docs.pytest.org/en/stable/reference/exit-codes.html):
+        0 = all tests passed
+        1 = some tests failed
+        2 = test execution interrupted
+        3 = internal error
+        4 = pytest CLI usage error
+        5 = no tests collected
+
+    We treat exit 5 as a pass: it just means every test in the file was
+    skipped or filtered by a marker (e.g. ``-m 'not integration'`` skips
+    files where every test is marked integration). That's intentional and
+    not a failure mode.
+
+    On per-file timeout (``file_timeout`` seconds) or any other exception
+    during ``communicate()``, we kill the whole process group / process
+    tree so grandchildren (uvicorn servers, async runtimes, etc.) do not
+    orphan onto PID 1. The pytest-timeout plugin enforces per-test
+    timeouts inside the subprocess; this outer timeout exists only to
+    bound a pathologically slow or hung file as a whole.
+    """
+    cmd = [sys.executable, "-m", "pytest", str(file), *pytest_args]
+    proc = subprocess.Popen(
+        cmd,
+        cwd=repo_root,
+        stdout=subprocess.PIPE,
+        stderr=subprocess.STDOUT,
+        text=True,
+        # POSIX: place the child at the head of its own process group so
+        # _kill_tree can SIGKILL the group atomically.
+        # Windows: this maps to CREATE_NEW_PROCESS_GROUP in CPython 3.12+;
+        # _kill_tree handles the Windows path via taskkill /F /T.
+        start_new_session=True,
+    )
+
+    # Capture the pgid NOW, before the leader can exit and be reaped.
+    # Once the leader is reaped, os.getpgid(proc.pid) raises
+    # ProcessLookupError even though grandchildren in that group are
+    # still alive — defeating the whole cleanup. None on Windows where
+    # the pgid concept doesn't apply (taskkill walks ppid chain instead).
+    pgid: int | None = None
+    if sys.platform != "win32":
+        try:
+            pgid = os.getpgid(proc.pid)
+        except (ProcessLookupError, PermissionError):
+            # Astonishingly fast child? Already dead. _kill_tree's
+            # fallback will handle this case as a no-op.
+            pgid = None
+
+    try:
+        output, _ = proc.communicate(timeout=file_timeout)
+        rc = proc.returncode
+    except subprocess.TimeoutExpired:
+        _kill_tree(proc, pgid=pgid)
+        # Drain whatever the child wrote before we killed it so we have
+        # something to surface in the failure dump.
+        try:
+            output, _ = proc.communicate(timeout=10)
+        except subprocess.TimeoutExpired:
+            output = "(file timeout exceeded; output unavailable)"
+        rc = 124  # de facto convention for "killed by timeout".
+        output = (
+            f"(per-file timeout: {file_timeout:.0f}s exceeded; "
+            f"process tree SIGKILL'd)\n{output}"
+        )
+    except BaseException:
+        # KeyboardInterrupt / runner crash — make sure no zombie
+        # grandchildren outlive us.
+        _kill_tree(proc, pgid=pgid)
+        raise
+    else:
+        # Happy path: pytest exited on its own. The child process already
+        # cleaned up its grandchildren if it's well-behaved, but
+        # well-behaved is not universal — kill the group anyway. Already-
+        # dead processes are a no-op.
+        _kill_tree(proc, pgid=pgid)
+
+    if rc == 5:
+        # No tests collected — every test in the file was filtered out.
+        # Treat as a pass; surface info in a slightly distinct status
+        # so the operator can spot it.
+        rc = 0
+    summary = _parse_pytest_summary(output)
+    return file, rc, output, summary
+
+
+def _parse_pytest_summary(output: str) -> dict[str, int]:
+    """Extract per-file test pass/fail/skip counts from pytest output.
+
+    pytest prints a summary line like ``12 passed, 3 skipped, 1 failed in 2.1s``
+    as the last non-empty line before the short test summary.  We scrape that
+    line for the individual counts so the progress display can show test-level
+    granularity instead of just file-level pass/fail.
+
+    Returns a dict with keys ``passed``, ``failed``, ``skipped``, ``errors``,
+    ``xfailed``, ``xpassed`` (only keys found in the output are present).
+    """
+    import re
+
+    result: dict[str, int] = {}
+    # Walk backwards from the end — the summary line is always near the tail.
+    for line in reversed(output.splitlines()):
+        line = line.strip()
+        if not line:
+            continue
+        # Match "N passed", "N failed", "N skipped", "N errors", "N xfailed", "N xpassed"
+        for m in re.finditer(r"(\d+)\s+(passed|failed|skipped|errors|xfailed|xpassed)", line):
+            result[m.group(2)] = int(m.group(1))
+        # Also match "N error" (singular — pytest uses this sometimes).
+        for m in re.finditer(r"(\d+)\s+error\b", line):
+            result.setdefault("errors", result.get("errors", 0) + int(m.group(1)))
+        if result:
+            # Found the counts line — done.
+            break
+        # Stop at the short test summary header (if any) — everything above
+        # that is individual failure details, not the counts line.
+        if line.startswith("FAILED") or line.startswith("SHORT TEST SUMMARY"):
+            break
+    return result
+
+
+def _format_file(file: Path, repo_root: Path) -> str:
+    """Render a test-file path for display: strip the repo-root prefix
+    when possible so output reads ``tests/acp/test_auth.py`` instead of
+    ``/home/runner/work/hermes-agent/hermes-agent/tests/acp/test_auth.py``.
+
+    Falls back to the absolute path for anything outside the repo root.
+    """
+    try:
+        return str(file.resolve().relative_to(repo_root.resolve()))
+    except ValueError:
+        return str(file)
+
+
+def _print_progress(
+    tests_done: int,
+    total_tests: int,
+    file: Path,
+    rc: int,
+    dur: float,
+    repo_root: Path,
+    tests_passed: int,
+    tests_failed: int,
+    test_counts: dict[Path, int],
+    file_summary: dict[str, int] | None = None,
+) -> None:
+    """Single-line live progress.
+
+    When ``file_summary`` is provided (parsed from pytest output), the
+    per-file parenthetical shows individual test pass/fail counts instead
+    of just the total test count.
+    """
+    status = "✓" if rc == 0 else "✗"
+    pct = (tests_done / total_tests * 100) if total_tests else 0
+    # Digit width for left-side counter padding (derived from total file count).
+    fw = len(str(tests_passed + tests_failed))
+    # Build per-file test count string.
+    if file_summary:
+        parts = []
+        p = file_summary.get("passed", 0)
+        f = file_summary.get("failed", 0)
+        s = file_summary.get("skipped", 0)
+        e = file_summary.get("errors", 0)
+        if p:
+            parts.append(f"{p}✓")
+        if f:
+            parts.append(f"{f}✗")
+        if s:
+            parts.append(f"{s}s")
+        if e:
+            parts.append(f"{e}e")
+        # xfailed/xpassed are rare; include if present.
+        xf = file_summary.get("xfailed", 0)
+        xp = file_summary.get("xpassed", 0)
+        if xf:
+            parts.append(f"{xf}xf")
+        if xp:
+            parts.append(f"{xp}xp")
+        test_str = " ".join(parts) + ", " if parts else ""
+    else:
+        n_tests = test_counts.get(file, 0)
+        test_str = f"{n_tests} tests, " if n_tests else ""
+    msg = (
+        f"[{pct:5.1f}% | {tests_done:>5}/{total_tests}"
+        f" | ✓{tests_passed:>{fw}} | ✗{tests_failed:>{fw}}] "
+        f"{status} {_format_file(file, repo_root)} ({test_str}{dur:.1f}s)"
+    )
+    # Truncate to terminal width if available (no clobbering ANSI lines).
+    try:
+        cols = os.get_terminal_size().columns
+        if len(msg) > cols:
+            msg = msg[: cols - 1] + "…"
+    except OSError:
+        pass
+    print(msg, flush=True)
+
+
+def _print_inline_failure(
+    file: Path, output: str, repo_root: Path, pytest_passthrough: List[str]
+) -> None:
+    """Print a compact failure summary immediately when a file fails.
+
+    Shows the tail of the pytest output (the failure section with stack
+    traces) and a ready-to-run repro command, so the developer doesn't
+    have to wait for the full run to finish before seeing what broke.
+    """
+    rel = _format_file(file, repo_root)
+    # Build a repro command the developer can copy-paste.
+    passthrough_str = " ".join(pytest_passthrough) if pytest_passthrough else ""
+    repro = f"python -m pytest {rel}"
+    if passthrough_str:
+        repro += f" {passthrough_str}"
+
+    # Grab just the failure lines (last ~30 lines of pytest output —
+    # typically the FAILED summary + short test info).
+    lines = output.rstrip().splitlines()
+    tail = "\n".join(lines[-30:])
+
+    print(flush=True)
+    print(f"  ╔╍ Failed: {rel} ╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍", flush=True)
+    for line in tail.splitlines():
+        print(f"  ║ {line}", flush=True)
+    print(f"  ║", flush=True)
+    print(f"  ║  Repro: {repro}", flush=True)
+    print(f"  ╚╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍", flush=True)
+    print(flush=True)
+
+
+def main() -> int:
+    parser = argparse.ArgumentParser(
+        description=__doc__,
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+    )
+    parser.add_argument(
+        "-j",
+        "--jobs",
+        type=int,
+        default=int(os.environ.get("HERMES_TEST_WORKERS") or (os.cpu_count() or 4) * 2),
+        help="Parallel worker count (default: $HERMES_TEST_WORKERS or cpu_count*2)",
+    )
+    parser.add_argument(
+        "--paths",
+        default=os.environ.get("HERMES_TEST_PATHS", ":".join(_DEFAULT_ROOTS)),
+        help="Colon-separated discovery roots (default: 'tests')",
+    )
+    parser.add_argument(
+        "--include-integration",
+        action="store_true",
+        help="Don't skip integration/ e2e/ during discovery",
+    )
+    parser.add_argument(
+        "--file-timeout",
+        type=float,
+        default=float(
+            os.environ.get("HERMES_TEST_FILE_TIMEOUT", _DEFAULT_FILE_TIMEOUT_SECONDS)
+        ),
+        help=(
+            "Per-file wall-clock cap in seconds. On timeout, the pytest "
+            "subprocess and its full process tree are SIGKILL'd. "
+            "Default: 600 (10 min), env: HERMES_TEST_FILE_TIMEOUT."
+        ),
+    )
+    parser.add_argument(
+        "paths_positional",
+        nargs="*",
+        metavar="PATH",
+        help=(
+            "Restrict discovery to these paths (directories or .py files). "
+            "Mutually exclusive with --paths. Anything after a literal '--' "
+            "separator is passed through to each per-file pytest invocation."
+        ),
+    )
+    # Manually split argv on '--' so positional paths and pytest passthrough
+    # args don't fight over each other. argparse's nargs="*" positional is
+    # greedy and will swallow everything after '--' including the pytest
+    # flags, defeating the convention.
+    argv = sys.argv[1:]
+    if "--" in argv:
+        sep = argv.index("--")
+        our_args, pytest_passthrough = argv[:sep], argv[sep + 1 :]
+    else:
+        our_args, pytest_passthrough = argv, []
+    args = parser.parse_args(our_args)
+
+    repo_root = Path(__file__).resolve().parent.parent
+
+    # Resolve discovery roots: positional path args override --paths if any
+    # were supplied, otherwise --paths (which itself defaults to 'tests').
+    if args.paths_positional:
+        # Positionals can be directories OR explicit .py files. Either is
+        # fine — _discover_files handles both via rglob('test_*.py') for
+        # dirs and direct inclusion for files.
+        roots = [repo_root / p for p in args.paths_positional]
+    else:
+        roots = [repo_root / p for p in args.paths.split(":") if p]
+
+    if args.include_integration:
+        # Caller takes responsibility — typically used via explicit -k filter.
+        global _SKIP_PARTS  # noqa: PLW0603 — config knob
+        _SKIP_PARTS = set()
+
+    files = _discover_files(roots)
+    if not files:
+        print(f"No test files discovered under {[str(r) for r in roots]}", file=sys.stderr)
+        return 1
+
+    # Count individual tests per file via a single pytest --co pass.
+    test_counts = _count_tests(files, repo_root, pytest_passthrough)
+    total_tests = sum(test_counts.values())
+
+    print(
+        f"Discovered {len(files)} test files ({total_tests} tests) under "
+        f"{[str(r.relative_to(repo_root)) if r.is_relative_to(repo_root) else str(r) for r in roots]}; "
+        f"running with -j {args.jobs}",
+        flush=True,
+    )
+
+    # Capture and print on completion (out-of-order is fine — keeps the
+    # terminal clean rather than interleaving N parallel pytest outputs).
+    failures: List[Tuple[Path, str, Dict[str, int]]] = []
+    started = time.monotonic()
+    files_done = 0
+    tests_done = 0
+    pass_count = 0
+    fail_count = 0
+    tests_passed = 0
+    tests_failed = 0
+    lock = threading.Lock()
+
+    def _on_done(file: Path, started_at: float, fut: "Future[Tuple[Path, int, str, dict[str, int]]]") -> None:
+        nonlocal files_done, tests_done, pass_count, fail_count, tests_passed, tests_failed
+        n_tests = test_counts.get(file, 0)
+        try:
+            fpath, rc, output, summary = fut.result()
+        except Exception as exc:  # noqa: BLE001 — must always advance counter
+            with lock:
+                files_done += 1
+                tests_done += n_tests
+                fail_count += 1
+                failures.append((file, f"runner crashed: {exc!r}", {}))
+                _print_progress(
+                    tests_done, total_tests, file, 1,
+                    time.monotonic() - started_at,
+                    repo_root, tests_passed, tests_failed,
+                    test_counts,
+                )
+            return
+        with lock:
+            files_done += 1
+            tests_done += n_tests
+            # Accumulate test-level counts from parsed summary.
+            tests_passed += summary.get("passed", 0)
+            tests_failed += summary.get("failed", 0)
+            if rc == 0:
+                pass_count += 1
+            else:
+                fail_count += 1
+                failures.append((fpath, output, summary))
+            _print_progress(
+                tests_done, total_tests, fpath, rc,
+                time.monotonic() - started_at,
+                repo_root, tests_passed, tests_failed,
+                test_counts,
+                file_summary=summary,
+            )
+            if rc != 0:
+                _print_inline_failure(fpath, output, repo_root, pytest_passthrough)
+
+    with ThreadPoolExecutor(max_workers=args.jobs) as pool:
+        futures: List[Future] = []
+        for file in files:
+            t0 = time.monotonic()
+            fut = pool.submit(
+                _run_one_file, file, pytest_passthrough, repo_root, args.file_timeout
+            )
+            fut.add_done_callback(lambda f, file=file, t0=t0: _on_done(file, t0, f))
+            futures.append(fut)
+        # Block until everything's done. ThreadPoolExecutor.__exit__ waits
+        # for all submitted work, but doing it explicitly here makes the
+        # control flow obvious.
+        for fut in futures:
+            fut.result() if fut.exception() is None else None
+
+    elapsed = time.monotonic() - started
+    print()
+    pct = (tests_done / total_tests * 100) if total_tests else 0
+    print(f"=== Summary: {len(files)} files, {tests_passed} tests passed, {tests_failed} failed ({pct:.0f}% complete) in {elapsed:.1f}s ({args.jobs} workers) ===")
+
+    if failures:
+        print()
+        print("=== Failure output ===")
+        for file, output, _summary in failures:
+            print()
+            print(f"--- {_format_file(file, repo_root)} ---")
+            print(output.rstrip())
+        print()
+        # Split: files with actual test failures vs non-zero exit for other reasons
+        test_fail_files = [(f, s) for f, _o, s in failures if s.get("failed", 0) > 0]
+        all_passed_but_nonzero = [(f, s) for f, _o, s in failures
+                                  if s.get("failed", 0) == 0 and s.get("passed", 0) > 0]
+        no_tests_ran = [(f, s) for f, _o, s in failures
+                        if s.get("failed", 0) == 0 and s.get("passed", 0) == 0]
+        if test_fail_files:
+            total_tf = sum(s.get("failed", 0) for _, s in test_fail_files)
+            print(f"=== {len(test_fail_files)} file{'s' if len(test_fail_files) != 1 else ''} with test failures ({total_tf} test{'s' if total_tf != 1 else ''} failed) ===")
+            for file, s in test_fail_files:
+                nf = s.get("failed", 0)
+                print(f"  {_format_file(file, repo_root)}  ({nf} test{'s' if nf != 1 else ''} failed)")
+        if all_passed_but_nonzero:
+            print(f"=== {len(all_passed_but_nonzero)} file{'s' if len(all_passed_but_nonzero) != 1 else ''} where all tests passed but pytest exited non-zero (warnings-as-errors, hook failures, etc.) ===")
+            for file, s in all_passed_but_nonzero:
+                print(f"  {_format_file(file, repo_root)}  ({s.get('passed', 0)} passed)")
+        if no_tests_ran:
+            print(f"=== {len(no_tests_ran)} file{'s' if len(no_tests_ran) != 1 else ''} where no tests ran (collection/import error, timeout before collection, etc.) ===")
+            for file, s in no_tests_ran:
+                print(f"  {_format_file(file, repo_root)}")
+        return 1
+
+    return 0
+
+
+if __name__ == "__main__":
+    sys.exit(main())
--- a/tests/agent/test_auxiliary_client.py
+++ b/tests/agent/test_auxiliary_client.py
@ -40,6 +40,16 @@ def _clean_env(monkeypatch):
        "ANTHROPIC_API_KEY", "ANTHROPIC_TOKEN", "CLAUDE_CODE_OAUTH_TOKEN",
    ):
        monkeypatch.delenv(key, raising=False)
+    # Module-level unhealthy cache (10-min TTL) leaks between tests;
+    # earlier tests that call _mark_provider_unhealthy() poison the
+    # cache for later ones, causing _resolve_auto to skip providers
+    # that the test patched to return valid clients.
+    import agent.auxiliary_client as _aux_mod
+    _aux_mod._aux_unhealthy_until.clear()
+    _aux_mod._aux_unhealthy_logged_at.clear()
+    yield
+    _aux_mod._aux_unhealthy_until.clear()
+    _aux_mod._aux_unhealthy_logged_at.clear()


@pytest.fixture
@ -461,6 +471,17 @@ class TestExpiredCodexFallback:
        import base64
        import time as _time

+        # Belt-and-suspenders: _try_openrouter marks openrouter unhealthy
+        # when OPENROUTER_API_KEY is absent (which the preceding test in
+        # this class exercises).  The file-level _clean_env autouse fixture
+        # clears the cache, but fixture ordering with the conftest
+        # _hermetic_environment autouse can leave a narrow window where
+        # the mark reappears.  Explicitly clear here so this test is
+        # independent of run order.
+        import agent.auxiliary_client as _aux_mod
+        _aux_mod._aux_unhealthy_until.clear()
+        _aux_mod._aux_unhealthy_logged_at.clear()
+
        header = base64.urlsafe_b64encode(b'{"alg":"RS256","typ":"JWT"}').rstrip(b"=").decode()
        payload_data = json.dumps({"exp": int(_time.time()) - 3600}).encode()
        payload = base64.urlsafe_b64encode(payload_data).rstrip(b"=").decode()
@ -1047,6 +1068,20 @@ class TestGetProviderChain:
 class TestTryPaymentFallback:
    """_try_payment_fallback skips the failed provider and tries alternatives."""

+    @pytest.fixture(autouse=True)
+    def _clear_unhealthy_cache(self):
+        """Earlier tests in this file call _mark_provider_unhealthy() which
+        pollutes the module-level ``_aux_unhealthy_until`` dict (10-min TTL).
+        Without this cleanup the fallback chain skips providers we've patched
+        to return valid clients — the patched function is never called.
+        """
+        from agent.auxiliary_client import _aux_unhealthy_until, _aux_unhealthy_logged_at
+        _aux_unhealthy_until.clear()
+        _aux_unhealthy_logged_at.clear()
+        yield
+        _aux_unhealthy_until.clear()
+        _aux_unhealthy_logged_at.clear()
+
    def test_skips_failed_provider(self):
        mock_client = MagicMock()
        with patch("agent.auxiliary_client._try_openrouter", return_value=(None, None)), \
--- a/tests/agent/test_skill_commands.py
+++ b/tests/agent/test_skill_commands.py
@ -556,10 +556,11 @@ Generate some audio.
            raising=False,
        )

-        with patch.dict(
-            os.environ, {"HERMES_SESSION_PLATFORM": "telegram"}, clear=False
-        ):
-            with patch("tools.skills_tool.SKILLS_DIR", tmp_path):
+        with patch("tools.skills_tool.SKILLS_DIR", tmp_path):
+            from gateway.session_context import clear_session_vars, set_session_vars
+
+            tokens = set_session_vars(platform="telegram")
+            try:
                _make_skill(
                    tmp_path,
                    "test-skill",
@ -571,6 +572,8 @@ Generate some audio.
                )
                scan_skill_commands()
                msg = build_skill_invocation_message("/test-skill", "do stuff")
+            finally:
+                clear_session_vars(tokens)

        assert msg is not None
        assert "local cli" in msg.lower()
--- a/tests/conftest.py
+++ b/tests/conftest.py
@ -20,12 +20,9 @@ test runner at ``scripts/run_tests.sh``.
 """

 import asyncio
-import logging
 import os
 import re
-import signal
 import sys
-import tempfile
 from pathlib import Path
 from unittest.mock import patch

@ -37,6 +34,22 @@ if str(PROJECT_ROOT) not in sys.path:
    sys.path.insert(0, str(PROJECT_ROOT))


+# ── Per-file process isolation ──────────────────────────────────────────────
+# Tests run via ``scripts/run_tests_parallel.py``, which spawns a fresh
+# ``python -m pytest <file>`` subprocess per test file. Cross-file state
+# leakage (module-level dicts, ContextVars, caches) is impossible: each
+# file gets a clean Python interpreter. Intra-file ordering is the test
+# author's responsibility — if test A in foo.py mutates state that test B
+# in foo.py reads, that's a real bug to fix in the file (it would also
+# bite anyone running ``pytest tests/foo.py`` directly).
+#
+# This replaces the historic _reset_module_state autouse fixture (manual
+# state clearing) and the brief experiment with subprocess-per-test
+# isolation (too slow at ~17k tests).
+#
+# See ``scripts/run_tests_parallel.py`` for the runner.
+
+
 # ── Credential env-var filter ──────────────────────────────────────────────
 #
 # Any env var in the current process matching ONE of these patterns is
@ -279,7 +292,7 @@ _HERMES_BEHAVIORAL_VARS = frozenset({
    "WECOM_HOME_CHANNEL_NAME",
    # Platform gating — set by load_gateway_config() as a side effect when
    # a config.yaml is present, so individual test bodies that call the
-    # loader leak these values into later tests on the same xdist worker.
+    # loader leak these values into later tests in the same process.
    # Force-clear on every test setup so the leak can't happen.
    "SLACK_REQUIRE_MENTION",
    "SLACK_STRICT_MENTION",
@ -368,144 +381,21 @@ def _isolate_hermes_home(_hermetic_environment):
    return None


-# ── Module-level state reset ───────────────────────────────────────────────
+# ── Module-level state reset — replaced by per-file process isolation ──────
 #
-# Python modules are singletons per process, and pytest-xdist workers are
-# long-lived. Module-level dicts/sets (tool registries, approval state,
-# interrupt flags) and ContextVars persist across tests in the same worker,
-# causing tests that pass alone to fail when run with siblings.
+# Each test FILE runs in a freshly-spawned ``python -m pytest <file>``
+# subprocess via ``scripts/run_tests_parallel.py``, so module-level dicts /
+# sets / ContextVars from tests in one file cannot leak into tests in
+# another file. No manual per-module clearing needed.
 #
-# Each entry in this fixture clears state that belongs to a specific module.
-# New state buckets go here too — this is the single gate that prevents
-# "works alone, flakes in CI" bugs from state leakage.
+# Within a single file, ordering is the author's responsibility. If your
+# tests in the same file share mutable state, either reset it explicitly
+# in a fixture or split them across files.
 #
-# The skill `test-suite-cascade-diagnosis` documents the concrete patterns
-# this closes; the running example was `test_command_guards` failing 12/15
-# CI runs because ``tools.approval._session_approved`` carried approvals
-# from one test's session into another's.
-
-@pytest.fixture(autouse=True)
-def _reset_module_state():
-    """Clear module-level mutable state and ContextVars between tests.
-
-    Keeps state from leaking across tests on the same xdist worker. Modules
-    that don't exist yet (test collection before production import) are
-    skipped silently — production import later creates fresh empty state.
-    """
-    # --- logging — quiet/one-shot paths mutate process-global logger state ---
-    logging.disable(logging.NOTSET)
-    for _logger_name in ("tools", "run_agent", "trajectory_compressor", "cron", "hermes_cli"):
-        _logger = logging.getLogger(_logger_name)
-        _logger.disabled = False
-        _logger.setLevel(logging.NOTSET)
-        _logger.propagate = True
-
-    # --- tools.approval — the single biggest source of cross-test pollution ---
-    try:
-        from tools import approval as _approval_mod
-        _approval_mod._session_approved.clear()
-        _approval_mod._session_yolo.clear()
-        _approval_mod._permanent_approved.clear()
-        _approval_mod._pending.clear()
-        _approval_mod._gateway_queues.clear()
-        _approval_mod._gateway_notify_cbs.clear()
-        # ContextVar: reset to empty string so get_current_session_key()
-        # falls through to the env var / default path, matching a fresh
-        # process.
-        _approval_mod._approval_session_key.set("")
-    except Exception:
-        pass
-
-    # --- tools.interrupt — per-thread interrupt flag set ---
-    try:
-        from tools import interrupt as _interrupt_mod
-        with _interrupt_mod._lock:
-            _interrupt_mod._interrupted_threads.clear()
-    except Exception:
-        pass
-
-    # --- gateway.session_context — 9 ContextVars that represent
-    #     the active gateway session. If set in one test and not reset,
-    #     the next test's get_session_env() reads stale values.
-    try:
-        from gateway import session_context as _sc_mod
-        for _cv in (
-            _sc_mod._SESSION_PLATFORM,
-            _sc_mod._SESSION_CHAT_ID,
-            _sc_mod._SESSION_CHAT_NAME,
-            _sc_mod._SESSION_THREAD_ID,
-            _sc_mod._SESSION_USER_ID,
-            _sc_mod._SESSION_USER_NAME,
-            _sc_mod._SESSION_KEY,
-            _sc_mod._CRON_AUTO_DELIVER_PLATFORM,
-            _sc_mod._CRON_AUTO_DELIVER_CHAT_ID,
-            _sc_mod._CRON_AUTO_DELIVER_THREAD_ID,
-        ):
-            _cv.set(_sc_mod._UNSET)
-    except Exception:
-        pass
-
-    # --- tools.env_passthrough — ContextVar<set[str]> with no default ---
-    # LookupError is normal if the test never set it. Setting it to an
-    # empty set unconditionally normalizes the starting state.
-    try:
-        from tools import env_passthrough as _envp_mod
-        _envp_mod._allowed_env_vars_var.set(set())
-    except Exception:
-        pass
-
-    # --- tools.terminal_tool — active environment/cwd cache ---
-    # File tools prefer a live terminal cwd when one is cached for the task.
-    # Clear terminal environments between tests so a prior terminal call can't
-    # override TERMINAL_CWD in path-resolution tests.
-    try:
-        from tools import terminal_tool as _term_mod
-        _envs_to_cleanup = []
-        with _term_mod._env_lock:
-            _envs_to_cleanup = list(_term_mod._active_environments.values())
-            _term_mod._active_environments.clear()
-            _term_mod._last_activity.clear()
-            _term_mod._creation_locks.clear()
-        for _env in _envs_to_cleanup:
-            try:
-                _env.cleanup()
-            except Exception:
-                pass
-    except Exception:
-        pass
-
-    # --- tools.credential_files — ContextVar<dict> ---
-    try:
-        from tools import credential_files as _credf_mod
-        _credf_mod._registered_files_var.set({})
-    except Exception:
-        pass
-
-    # --- agent.auxiliary_client — runtime main provider/model override and
-    #     payment-error health cache. Both are process-global in production;
-    #     reset them per test so one worker's fallback/402 test does not make
-    #     later auxiliary-client tests skip otherwise-available providers.
-    try:
-        from agent import auxiliary_client as _aux_mod
-        _aux_mod.clear_runtime_main()
-        _aux_mod._reset_aux_unhealthy_cache()
-    except Exception:
-        pass
-
-    # --- tools.file_tools — per-task read history + file-ops cache ---
-    # _read_tracker accumulates per-task_id read history for loop detection,
-    # capped by _READ_HISTORY_CAP. If entries from a prior test persist, the
-    # cap is hit faster than expected and capacity-related tests flake.
-    try:
-        from tools import file_tools as _ft_mod
-        with _ft_mod._read_tracker_lock:
-            _ft_mod._read_tracker.clear()
-        with _ft_mod._file_ops_lock:
-            _ft_mod._file_ops_cache.clear()
-    except Exception:
-        pass
-
-    yield
+# The skill ``test-suite-cascade-diagnosis`` documents the cascade patterns
+# this replaces; the running example was ``test_command_guards`` failing
+# 12/15 CI runs because ``tools.approval._session_approved`` carried
+# approvals from one test's session into another's.


@pytest.fixture()
@ -532,13 +422,12 @@ def mock_config():
    }


-# ── Global test timeout ─────────────────────────────────────────────────────
-# Kill any individual test that takes longer than 30 seconds.
-# Prevents hanging tests (subprocess spawns, blocking I/O) from stalling the
-# entire test suite.
+# ── Per-test timeout — handled by the isolation plugin ─────────────────────
+#
+# The subprocess-per-test plugin enforces the configured ``isolate_timeout``
+# ini key by terminating the child if it overruns. The old SIGALRM-based
+# fixture (POSIX-only, didn't work on Windows) is gone.

-def _timeout_handler(signum, frame):
-    raise TimeoutError("Test exceeded 30 second timeout")

@pytest.fixture(autouse=True)
 def _ensure_current_event_loop(request):
@ -584,45 +473,6 @@ def _ensure_current_event_loop(request):
                asyncio.set_event_loop(None)


-@pytest.fixture(autouse=True)
-def _enforce_test_timeout():
-    """Kill any individual test that takes longer than 30 seconds.
-    SIGALRM is Unix-only; skip on Windows."""
-    if sys.platform == "win32":
-        yield
-        return
-    old = signal.signal(signal.SIGALRM, _timeout_handler)
-    signal.alarm(30)
-    yield
-    signal.alarm(0)
-    signal.signal(signal.SIGALRM, old)
-
-
-@pytest.fixture(autouse=True)
-def _reset_tool_registry_caches():
-    """Clear tool-registry-level caches between tests.
-
-    The production registry caches ``check_fn()`` results for 30 s
-    (see tools/registry.py) and :func:`get_tool_definitions` memoizes
-    its result (see model_tools.py). Both are keyed on state that tests
-    routinely mutate (env vars, registry._generation, config.yaml mtime)
-    — but a stale result from test A can still be served to test B
-    because 30 s covers the entire suite, and xdist worker reuse means
-    one test's cache lands in another's process. Clearing before every
-    test keeps hermetic behavior.
-    """
-    try:
-        from tools.registry import invalidate_check_fn_cache
-        invalidate_check_fn_cache()
-    except ImportError:
-        pass
-    try:
-        from model_tools import _clear_tool_defs_cache
-        _clear_tool_defs_cache()
-    except ImportError:
-        pass
-
-
 # ── Live-system guard ──────────────────────────────────────────────────────
 #
 # Several test files exercise the gateway-restart / kill code paths
--- a/tests/gateway/conftest.py
+++ b/tests/gateway/conftest.py
@ -313,19 +313,30 @@ def _scan_for_plugin_adapter_antipattern(source: str) -> list[str]:
    return offenses


-def pytest_configure(config):
-    """Reject plugin-adapter tests that use the sys.path anti-pattern.
+def _fingerprint_gateway_tests() -> str:
+    """Return a short fingerprint that changes when any gateway test file changes.

-    Runs once per pytest session on the controller, BEFORE any xdist
-    worker is spawned. If any file under ``tests/gateway/`` matches the
-    anti-pattern, we fail the whole session with a clear message —
-    before a polluted ``sys.path`` can cascade across workers.
+    Uses (mtime, size) pairs instead of content hashing — fast to compute
+    (stat-only, no reads) and sufficient for cache invalidation across
+    per-file subprocess runs.
    """
-    # Only run on the xdist controller (or in non-xdist runs). Skip on
-    # worker subprocesses so we don't scan the filesystem N times.
-    if hasattr(config, "workerinput"):
-        return
+    import hashlib

+    h = hashlib.sha256()
+    for path in sorted(_GATEWAY_DIR.rglob("test_*.py")):
+        try:
+            st = path.stat()
+            h.update(f"{path.name}:{st.st_mtime_ns}:{st.st_size}".encode())
+        except OSError:
+            h.update(f"{path.name}:missing".encode())
+    return h.hexdigest()[:16]
+
+
+def _run_adapter_antipattern_scan() -> list[str]:
+    """Scan gateway test files for the plugin-adapter anti-pattern.
+
+    Returns a list of violation strings (empty if clean).
+    """
    violations: list[str] = []
    for path in _GATEWAY_DIR.rglob("test_*.py"):
        if path.name in {"_plugin_adapter_loader.py", "conftest.py"}:
@ -334,20 +345,108 @@ def pytest_configure(config):
            source = path.read_text(encoding="utf-8")
        except OSError:
            continue
+        # Fast string pre-filter: skip files that can't possibly violate.
+        # A violating file MUST contain both (a) an adapter/plugins/platforms
+        # reference AND (b) either sys.path manipulation or a bare adapter import.
        if "adapter" not in source and "plugins/platforms" not in source:
            continue
+        if not (
+            "sys.path" in source
+            or "import adapter" in source
+            or "from adapter import" in source
+        ):
+            continue
        offenses = _scan_for_plugin_adapter_antipattern(source)
        if offenses:
            violations.append(
                f"  {path.relative_to(_GATEWAY_DIR.parent.parent)}:\n    "
                + "\n    ".join(offenses)
            )
+    return violations

-    if violations:
-        raise pytest.UsageError(
-            "Plugin-adapter-import anti-pattern detected in gateway tests:\n"
-            + "\n".join(violations)
-            + "\n\n"
-            + _GUARD_HINT
-        )
+
+def pytest_configure(config):
+    """Reject plugin-adapter tests that use the sys.path anti-pattern.
+
+    Runs once per pytest session on the controller, BEFORE any xdist
+    worker is spawned. If any file under ``tests/gateway/`` matches the
+    anti-pattern, we fail the whole session with a clear message —
+    before a polluted ``sys.path`` can cascade across workers.
+
+    **Performance**: in the per-file subprocess isolation model (no xdist),
+    every subprocess is a "controller" — so the naive scan would run 257
+    times, each costing ~1s of AST walking.  We avoid this with two
+    strategies:
+
+    1. **Tight string pre-filter**: a file can only violate if it contains
+       *both* an adapter/plugins/platforms reference *and* a sys.path
+       manipulation or bare ``import adapter``.  This drops ~95% of files
+       from needing AST parsing.
+    2. **File-locked cache**: the scan result is cached in
+       ``.pytest-cache/gw-adapter-guard-<fingerprint>`` keyed on a
+       fingerprint of the gateway test file mtimes/sizes.  Concurrent
+       subprocesses acquire a lock; only the first performs the scan;
+       the rest wait and read the cached result.
+    """
+    # Only run on the xdist controller (or in non-xdist runs). Skip on
+    # worker subprocesses so we don't scan the filesystem N times.
+    if hasattr(config, "workerinput"):
+        return
+
+    fp = _fingerprint_gateway_tests()
+    cache_dir = Path.cwd() / ".pytest-cache"
+    cache_file = cache_dir / f"gw-adapter-guard-{fp}"
+    lock_file = cache_dir / f".gw-adapter-guard-{fp}.lock"
+
+    cache_dir.mkdir(parents=True, exist_ok=True)
+
+    # Evict stale cache entries from previous fingerprints (best-effort).
+    try:
+        for old in cache_dir.glob("gw-adapter-guard-*"):
+            if old.name != f"gw-adapter-guard-{fp}":
+                old.unlink(missing_ok=True)
+        for old in cache_dir.glob(".gw-adapter-guard-*.lock"):
+            if old.name != f".gw-adapter-guard-{fp}.lock":
+                old.unlink(missing_ok=True)
+    except OSError:
+        pass  # Non-critical; old files are harmless.
+
+    # Use filelock to ensure only one process scans at a time.
+    # Concurrent subprocesses all hit pytest_configure simultaneously;
+    # without a lock they'd all find no cache and all run the scan.
+    try:
+        from filelock import FileLock
+        lock = FileLock(str(lock_file), timeout=120)
+    except ImportError:
+        # Fallback: no locking (still correct, just slower under contention).
+        import contextlib
+
+        class _NoLock:
+            def __enter__(self):
+                return self
+            def __exit__(self, *a):
+                pass
+        lock = _NoLock()
+
+    with lock:
+        if cache_file.exists():
+            cached = cache_file.read_text(encoding="utf-8")
+            if cached == "clean":
+                return
+            raise pytest.UsageError(cached)
+
+        # Slow path: this process is the first to acquire the lock.
+        violations = _run_adapter_antipattern_scan()
+
+        if violations:
+            msg = (
+                "Plugin-adapter-import anti-pattern detected in gateway tests:\n"
+                + "\n".join(violations)
+                + "\n\n"
+                + _GUARD_HINT
+            )
+            cache_file.write_text(msg, encoding="utf-8")
+            raise pytest.UsageError(msg)
+        else:
+            cache_file.write_text("clean", encoding="utf-8")

--- a/tests/gateway/test_dm_topics.py
+++ b/tests/gateway/test_dm_topics.py
@ -22,19 +22,26 @@ from gateway.config import PlatformConfig


 def _ensure_telegram_mock():
-    if "telegram" in sys.modules and hasattr(sys.modules["telegram"], "__file__"):
-        return
-
    telegram_mod = MagicMock()
    telegram_mod.ext.ContextTypes.DEFAULT_TYPE = type(None)
-    telegram_mod.constants.ParseMode.MARKDOWN_V2 = "MarkdownV2"
-    telegram_mod.constants.ChatType.GROUP = "group"
-    telegram_mod.constants.ChatType.SUPERGROUP = "supergroup"
-    telegram_mod.constants.ChatType.CHANNEL = "channel"
-    telegram_mod.constants.ChatType.PRIVATE = "private"

-    for name in ("telegram", "telegram.ext", "telegram.constants", "telegram.request"):
-        sys.modules.setdefault(name, telegram_mod)
+    # Register telegram.constants as a separate module mock so that
+    # ``from telegram.constants import ChatType`` resolves to our mock
+    # with string-valued members (not auto-generated MagicMocks).
+    constants_mod = MagicMock()
+    constants_mod.ParseMode.MARKDOWN_V2 = "MarkdownV2"
+    constants_mod.ChatType.GROUP = "group"
+    constants_mod.ChatType.SUPERGROUP = "supergroup"
+    constants_mod.ChatType.CHANNEL = "channel"
+    constants_mod.ChatType.PRIVATE = "private"
+
+    sys.modules["telegram"] = telegram_mod
+    sys.modules["telegram.ext"] = telegram_mod.ext
+    sys.modules["telegram.constants"] = constants_mod
+    sys.modules["telegram.request"] = telegram_mod.request
+
+    # Force reimport so the adapter picks up the mock ChatType.
+    sys.modules.pop("gateway.platforms.telegram", None)


 _ensure_telegram_mock()
--- a/tests/gateway/test_google_chat.py
+++ b/tests/gateway/test_google_chat.py
@ -22,6 +22,11 @@ import pytest

 from gateway.config import Platform, PlatformConfig, load_gateway_config

+# Platform uses _missing_() for dynamic members, so "google_chat" is
+# resolvable via Platform("google_chat") even without a static
+# GOOGLE_CHAT attribute on the enum class.
+_GC = Platform("google_chat")
+

 # ---------------------------------------------------------------------------
 # Mock the google-* packages if they are not installed
@ -229,7 +234,7 @@ def _make_chat_envelope(text="hello", sender_email="u@example.com", sender_type=

 class TestPlatformRegistration:
    def test_enum_value(self):
-        assert Platform.GOOGLE_CHAT.value == "google_chat"
+        assert _GC.value == "google_chat"

    def test_requirements_check_returns_true_when_available(self):
        # The shim flag is True in this test module.
@ -266,14 +271,14 @@ class TestEnvConfigLoading:
        monkeypatch.setenv("GOOGLE_CHAT_PROJECT_ID", "p")
        # No subscription.
        cfg = load_gateway_config()
-        assert Platform.GOOGLE_CHAT not in cfg.platforms
+        assert _GC not in cfg.platforms

    def test_missing_project_does_not_enable(self, monkeypatch):
        self._clean_env(monkeypatch)
        monkeypatch.setenv("GOOGLE_CHAT_SUBSCRIPTION_NAME",
                           "projects/p/subscriptions/s")
        cfg = load_gateway_config()
-        assert Platform.GOOGLE_CHAT not in cfg.platforms
+        assert _GC not in cfg.platforms



@ -2583,7 +2588,7 @@ class TestAuthorizationEmailMatch:
        runner.pairing_store.is_approved = MagicMock(return_value=False)

        source = SessionSource(
-            platform=Platform.GOOGLE_CHAT,
+            platform=_GC,
            chat_id="spaces/S",
            chat_type="dm",
            user_id="alice@example.com",       # post-swap: email is canonical
@ -2604,7 +2609,7 @@ class TestAuthorizationEmailMatch:
        runner.pairing_store.is_approved = MagicMock(return_value=False)

        source = SessionSource(
-            platform=Platform.GOOGLE_CHAT,
+            platform=_GC,
            chat_id="spaces/S",
            chat_type="dm",
            user_id="bob@example.com",
@ -2630,7 +2635,7 @@ class TestAuthorizationEmailMatch:
        runner.pairing_store.is_approved = MagicMock(return_value=False)

        source = SessionSource(
-            platform=Platform.GOOGLE_CHAT,
+            platform=_GC,
            chat_id="spaces/S",
            chat_type="dm",
            user_id="users/77777",  # no email available — resource name wins
--- a/tests/hermes_cli/test_pty_bridge.py
+++ b/tests/hermes_cli/test_pty_bridge.py
@ -7,6 +7,7 @@ printf) to verify it behaves like a PTY you can read/write/resize/close.
 from __future__ import annotations

 import os
+import shutil
 import sys
 import time

@ -66,7 +67,7 @@ class TestPtyBridgeIO:
    def test_write_sends_to_child_stdin(self):
        # `cat` with no args echoes stdin back to stdout.  We write a line,
        # read it back, then signal EOF to let cat exit cleanly.
-        bridge = PtyBridge.spawn(["/bin/cat"])
+        bridge = PtyBridge.spawn([shutil.which("cat") or "cat"])
        try:
            bridge.write(b"hello-pty\n")
            output = _read_until(bridge, b"hello-pty")
--- a/tests/plugins/test_achievements_plugin.py
+++ b/tests/plugins/test_achievements_plugin.py
@ -62,8 +62,9 @@ def plugin_api(tmp_path, monkeypatch):
 class _FakeSessionDB:
    """Stand-in for hermes_state.SessionDB that records scan calls."""

-    def __init__(self, session_count: int):
+    def __init__(self, session_count: int, scan_delay: float = 0):
        self.session_count = session_count
+        self.scan_delay = scan_delay
        self.last_limit: Optional[int] = None
        self.last_include_children: Optional[bool] = None
        self.list_calls = 0
@ -78,6 +79,8 @@ class _FakeSessionDB:
        include_children: bool = False,
        project_compression_tips: bool = True,
    ) -> List[Dict[str, Any]]:
+        if self.scan_delay:
+            time.sleep(self.scan_delay)
        self.last_limit = limit
        self.last_include_children = include_children
        self.list_calls += 1
@ -225,10 +228,8 @@ def test_evaluate_all_stale_cache_serves_stale_and_refreshes_in_background(plugi
    the stale data immediately and kicks a background refresh. Users don't
    stare at a loading spinner every time TTL expires.
    """
-    fake_db = _FakeSessionDB(session_count=10)
+    fake_db = _FakeSessionDB(session_count=10, scan_delay=2.0)
    _install_fake_session_db(plugin_api, fake_db)
-
-    # Seed a stale snapshot on disk.
    stale_generated_at = int(time.time()) - plugin_api.SNAPSHOT_TTL_SECONDS - 60
    stale_payload = {
        "achievements": [],
--- a/tests/plugins/web/test_web_search_provider_plugins.py
+++ b/tests/plugins/web/test_web_search_provider_plugins.py
@ -2,8 +2,8 @@

 Covers:

- All seven bundled plugins (brave-free, ddgs, searxng, exa, parallel,
-  tavily, firecrawl) instantiate and self-report the expected
+- All eight bundled plugins (brave-free, ddgs, searxng, exa, parallel,
+  tavily, firecrawl, xai) instantiate and self-report the expected
  capabilities + ABC-derived defaults.
 - Each plugin's ``is_available()`` correctly reflects env-var presence.
 - The web_search_registry resolves an active provider in the documented
@ -47,6 +47,7 @@ def _clear_web_env(monkeypatch: pytest.MonkeyPatch) -> None:
        "FIRECRAWL_GATEWAY_URL",
        "TOOL_GATEWAY_DOMAIN",
        "TOOL_GATEWAY_USER_TOKEN",
+        "XAI_API_KEY",
    ):
        monkeypatch.delenv(k, raising=False)

@ -70,7 +71,7 @@ def _isolate_env(monkeypatch: pytest.MonkeyPatch) -> None:


 class TestBundledPluginsRegister:
-    """All seven bundled web plugins discover and register correctly."""
+    """All eight bundled web plugins discover and register correctly."""

    def test_all_seven_plugins_present_in_registry(self) -> None:
        _ensure_plugins_loaded()
@ -85,6 +86,7 @@ class TestBundledPluginsRegister:
            "parallel",
            "searxng",
            "tavily",
+            "xai",
        ]

    @pytest.mark.parametrize(
@ -100,6 +102,8 @@ class TestBundledPluginsRegister:
            # disabled in the migration (fell through to a legacy inline
            # path); the follow-up commit enabled it natively.
            ("firecrawl", True, True, True),
+            # xai: search-only via Grok's agentic web_search tool.
+            ("xai", True, False, False),
        ],
    )
    def test_capability_flags_match_spec(
@ -120,7 +124,7 @@ class TestBundledPluginsRegister:

    @pytest.mark.parametrize(
        "plugin_name",
-        ["brave-free", "ddgs", "searxng", "exa", "parallel", "tavily", "firecrawl"],
+        ["brave-free", "ddgs", "searxng", "exa", "parallel", "tavily", "firecrawl", "xai"],
    )
    def test_each_plugin_has_name_and_display_name(self, plugin_name: str) -> None:
        _ensure_plugins_loaded()
@ -133,7 +137,7 @@ class TestBundledPluginsRegister:

    @pytest.mark.parametrize(
        "plugin_name",
-        ["brave-free", "ddgs", "searxng", "exa", "parallel", "tavily", "firecrawl"],
+        ["brave-free", "ddgs", "searxng", "exa", "parallel", "tavily", "firecrawl", "xai"],
    )
    def test_each_plugin_has_setup_schema(self, plugin_name: str) -> None:
        """``get_setup_schema()`` returns a dict the picker can consume."""
@ -239,6 +243,17 @@ class TestIsAvailable:
        # Truthy or falsy, just must not raise.
        _ = bool(p.is_available())

+    def test_xai_requires_api_key_or_oauth(self, monkeypatch: pytest.MonkeyPatch) -> None:
+        """xAI needs XAI_API_KEY or OAuth tokens in auth.json."""
+        _ensure_plugins_loaded()
+        from agent.web_search_registry import get_provider
+
+        p = get_provider("xai")
+        assert p is not None
+        assert p.is_available() is False  # no XAI_API_KEY, no auth.json
+        monkeypatch.setenv("XAI_API_KEY", "real")
+        assert p.is_available() is True
+

 # ---------------------------------------------------------------------------
 # Registry resolution semantics (Option B — conservative smart fallback)
@ -455,7 +470,7 @@ class TestErrorResponseShapes:
        if result["results"]:
            assert "error" in result["results"][0]

-    def test_firecrawl_crawl_returns_error_dict_when_unconfigured(self) -> None:
+    def test_firecrawl_crawl_returns_error_dict_when_unconfigured(self):
        """firecrawl crawl is async (wraps SDK in to_thread); error must be
        surfaced via the per-page result shape, not raised."""
        _ensure_plugins_loaded()
@ -473,3 +488,15 @@ class TestErrorResponseShapes:
        assert len(result["results"]) >= 1
        assert "error" in result["results"][0]
        assert result["results"][0]["url"] == "https://example.com"
+
+    def test_xai_search_returns_error_dict_when_unconfigured(self) -> None:
+        """xAI returns a typed error dict (no XAI_API_KEY)."""
+        _ensure_plugins_loaded()
+        from agent.web_search_registry import get_provider
+
+        p = get_provider("xai")
+        assert p is not None
+        result = p.search("test", limit=5)
+        assert isinstance(result, dict)
+        assert result.get("success") is False
+        assert "error" in result
--- a/tests/test_run_tests_parallel.py
+++ b/tests/test_run_tests_parallel.py
@ -0,0 +1,187 @@
+"""Verify scripts/run_tests_parallel.py kills test-spawned grandchildren.
+
+Setup
+-----
+A test in this file spawns a long-lived Python grandchild that writes
+its PID + a nonce to a tempfile, then exits without cleaning up.
+With the old ``subprocess.run`` runner, that grandchild would orphan
+and outlive the test (and the whole runner). With the current Popen +
+``start_new_session`` + ``_kill_tree`` runner, the grandchild gets
+SIGKILL'd via process-group kill when its file's pytest exits.
+
+The leaker test always passes — its only job is to spawn a grandchild
+and walk away. The verifier runs the runner over the leaker file in a
+subprocess, then waits for the grandchild PID to disappear from the
+kernel's process table.
+
+POSIX-only: Windows has its own grandchild lifecycle (no shared session,
+``taskkill /F /T`` semantics). Marked accordingly.
+"""
+
+from __future__ import annotations
+
+import json
+import os
+import subprocess
+import sys
+import textwrap
+import time
+from pathlib import Path
+
+import pytest
+
+
+# Both tests share the same handoff file: the leaker writes here, the
+# verifier reads here. We park it in $TMPDIR with a unique-per-run name
+# so concurrent invocations of the suite don't clobber each other.
+_HANDOFF_DIR = Path(os.environ.get("TMPDIR", "/tmp")) / "hermes-isolation-probe"
+_HANDOFF_DIR.mkdir(exist_ok=True)
+
+
+def _handoff_path_for(nonce: str) -> Path:
+    return _HANDOFF_DIR / f"grandchild-{nonce}.json"
+
+
+def _pid_alive(pid: int) -> bool:
+    """POSIX: send signal 0 to probe whether ``pid`` is still alive.
+
+    ``os.kill(pid, 0)`` raises ``ProcessLookupError`` if the process is
+    gone, ``PermissionError`` if it exists but we can't signal it
+    (someone else's pid). We treat PermissionError as "alive" because
+    the process exists and that's all we need to know.
+    """
+    if sys.platform == "win32":  # pragma: no cover — POSIX-only test
+        # On Windows we'd use OpenProcess + GetExitCodeProcess; this
+        # test is skipped on Windows so the path is unreachable.
+        raise RuntimeError("_pid_alive POSIX-only")
+    try:
+        os.kill(pid, 0)
+    except ProcessLookupError:
+        return False
+    except PermissionError:
+        return True
+    return True
+
+
+@pytest.mark.skipif(sys.platform == "win32", reason="POSIX-only probe")
+@pytest.mark.live_system_guard_bypass
+def test_grandchild_leak_is_killed_by_runner(tmp_path: Path) -> None:
+    """Run the parallel runner over a probe file and verify cleanup.
+
+    1. Materialize a probe file that spawns a long-lived grandchild and
+       writes its PID to disk before exiting.
+    2. Invoke ``scripts/run_tests_parallel.py`` against the probe file.
+    3. Wait for the grandchild PID to vanish (poll for ~5s).
+    4. Assert the runner exited cleanly AND the grandchild is dead.
+    """
+    repo_root = Path(__file__).resolve().parent.parent
+    runner = repo_root / "scripts" / "run_tests_parallel.py"
+    assert runner.exists(), f"runner missing at {runner}"
+
+    # Probe lives in a temp dir, NOT under tests/, so the regular suite
+    # never picks it up — only our explicit invocation does.
+    probe_dir = tmp_path / "probe"
+    probe_dir.mkdir()
+    probe = probe_dir / "test_probe_leaker.py"
+    nonce = f"{os.getpid()}-{int(time.time() * 1000)}"
+    handoff = _handoff_path_for(nonce)
+    if handoff.exists():
+        handoff.unlink()
+
+    probe_src = textwrap.dedent(f"""
+        import json, os, subprocess, sys, time
+        from pathlib import Path
+
+        HANDOFF = Path({str(handoff)!r})
+
+        def test_spawns_grandchild_and_walks_away():
+            # Long-lived grandchild: detached, ignores SIGTERM (we want
+            # SIGKILL or process-group kill to be the only thing that
+            # works, simulating a misbehaving server).
+            child = subprocess.Popen(
+                [
+                    sys.executable, "-c",
+                    "import os, signal, sys, time; "
+                    "signal.signal(signal.SIGTERM, signal.SIG_IGN); "
+                    "sys.stdout.write(f'gc-pgid={{os.getpgid(0)}} gc-pid={{os.getpid()}}\\\\n'); "
+                    "sys.stdout.flush(); "
+                    "time.sleep(600)",
+                ],
+                stdout=subprocess.PIPE,
+                stderr=subprocess.STDOUT,
+                # IMPORTANT: do NOT pass start_new_session here. We want
+                # the grandchild to inherit the pytest subprocess's
+                # process group, so when the runner kills the group the
+                # grandchild dies too.
+            )
+            # Read the first line so we can record gc's pgid in the
+            # handoff, then walk away — don't close the pipe (would
+            # signal EOF and let the child see SIGPIPE on next write).
+            first_line = child.stdout.readline().decode().strip()
+            HANDOFF.write_text(json.dumps({{
+                "pid": child.pid,
+                "diag": first_line,
+                "test_pid": os.getpid(),
+                "test_pgid": os.getpgid(0),
+            }}))
+            assert child.pid > 0
+    """).strip()
+    probe.write_text(probe_src + "\n")
+
+    # Run the parallel runner against just the probe file. The runner
+    # discovers under ``tests/`` by default, so we override via --paths.
+    proc = subprocess.run(
+        [
+            sys.executable,
+            str(runner),
+            "--paths",
+            str(probe_dir),
+            "-j",
+            "1",
+            # Tight per-file timeout: the probe finishes in <1s, no
+            # need for 10min.
+            "--file-timeout",
+            "30",
+        ],
+        cwd=repo_root,
+        stdout=subprocess.PIPE,
+        stderr=subprocess.STDOUT,
+        text=True,
+        timeout=60,
+    )
+
+    assert handoff.exists(), (
+        f"probe never wrote handoff file; runner output:\n{proc.stdout}"
+    )
+    handoff_data = json.loads(handoff.read_text())
+    grandchild_pid = handoff_data["pid"]
+    diag = handoff_data.get("diag", "(no diag)")
+    test_pid = handoff_data.get("test_pid")
+    test_pgid = handoff_data.get("test_pgid")
+    handoff.unlink()
+
+    # The runner must have exited cleanly (probe test passes).
+    assert proc.returncode == 0, (
+        f"runner exited {proc.returncode}; output:\n{proc.stdout}"
+    )
+
+    # The grandchild must be gone. Poll for a bit because process-group
+    # SIGKILL + reaping isn't synchronous; on a loaded box it can take
+    # a beat.
+    deadline = time.monotonic() + 5.0
+    while time.monotonic() < deadline:
+        if not _pid_alive(grandchild_pid):
+            break
+        time.sleep(0.05)
+    else:
+        # Test cleanup: kill the leaked grandchild ourselves so a
+        # FAILED assertion doesn't leave a sleep(600) running.
+        try:
+            os.kill(grandchild_pid, 9)
+        except ProcessLookupError:
+            pass
+        pytest.fail(
+            f"grandchild PID {grandchild_pid} survived runner exit; "
+            f"diag={diag!r} test_pid={test_pid} test_pgid={test_pgid}; "
+            f"runner output:\n{proc.stdout}"
+        )
--- a/tests/tools/conftest.py
+++ b/tests/tools/conftest.py
@ -0,0 +1,50 @@
+"""Shared fixtures for tests/tools/ web-provider tests.
+
+Per-file subprocess isolation means each test file gets a fresh interpreter,
+so module-level state (like the web-search-provider registry) is empty when
+a file starts.  The ``web_registry_populated`` fixture registers all bundled
+providers before each test and resets the registry afterwards — tests that
+depend on the registry being populated should use it explicitly or via
+``@pytest.mark.usefixtures("web_registry_populated")``.
+"""
+
+import pytest
+
+
+def register_all_web_providers():
+    """Register all bundled web-search providers into the global registry.
+
+    This is the single source of truth for the provider list used by
+    test classes that need the registry populated for dispatch checks.
+    """
+    from agent.web_search_registry import register_provider, _reset_for_tests
+    from plugins.web.brave_free.provider import BraveFreeWebSearchProvider
+    from plugins.web.ddgs.provider import DDGSWebSearchProvider
+    from plugins.web.exa.provider import ExaWebSearchProvider
+    from plugins.web.firecrawl.provider import FirecrawlWebSearchProvider
+    from plugins.web.parallel.provider import ParallelWebSearchProvider
+    from plugins.web.searxng.provider import SearXNGWebSearchProvider
+    from plugins.web.tavily.provider import TavilyWebSearchProvider
+    from plugins.web.xai.provider import XAIWebSearchProvider
+
+    _reset_for_tests()
+    for cls in (
+        BraveFreeWebSearchProvider,
+        DDGSWebSearchProvider,
+        ExaWebSearchProvider,
+        FirecrawlWebSearchProvider,
+        ParallelWebSearchProvider,
+        SearXNGWebSearchProvider,
+        TavilyWebSearchProvider,
+        XAIWebSearchProvider,
+    ):
+        register_provider(cls())
+
+
+@pytest.fixture
+def web_registry_populated():
+    """Populate the web-search-provider registry for one test, then reset."""
+    register_all_web_providers()
+    yield
+    from agent.web_search_registry import _reset_for_tests
+    _reset_for_tests()
--- a/tests/tools/test_approval_plugin_hooks.py
+++ b/tests/tools/test_approval_plugin_hooks.py
@ -22,18 +22,28 @@ from tools.approval import (


@pytest.fixture
-def isolated_session(monkeypatch):
-    """Give each test a fresh session_key and clean approval-state."""
+def isolated_session(monkeypatch, tmp_path):
+    """Give each test a fresh session_key, clean approval-state, and isolated
+    HERMES_HOME so the real user's command_allowlist doesn't leak in."""
+    import tools.approval as _am
+
    session_key = "test:session:approval_hooks"
    token = set_current_session_key(session_key)
    monkeypatch.setenv("HERMES_SESSION_KEY", session_key)
    # Make sure we don't skip guards via yolo / approvals.mode=off
    monkeypatch.delenv("HERMES_YOLO_MODE", raising=False)
+    # Isolate from the real user's permanent allowlist + session state
+    _saved_permanent = _am._permanent_approved.copy()
+    _saved_session = {k: v.copy() for k, v in _am._session_approved.items()}
+    _am._permanent_approved.clear()
+    _am._session_approved.clear()
    try:
        yield session_key
    finally:
+        _am._permanent_approved.update(_saved_permanent)
+        _am._session_approved.update(_saved_session)
        try:
-            approval_module._approval_session_key.reset(token)
+            _am._approval_session_key.reset(token)
        except Exception:
            pass
        clear_session(session_key)
--- a/tests/tools/test_browser_supervisor.py
+++ b/tests/tools/test_browser_supervisor.py
@ -41,7 +41,7 @@ def _find_chrome() -> str:


@pytest.fixture
-def chrome_cdp(worker_id):
+def chrome_cdp(request):
    """Start a headless Chrome with --remote-debugging-port, yield its WS URL.

    Uses a unique port per xdist worker to avoid cross-worker collisions.
@ -51,6 +51,9 @@ def chrome_cdp(worker_id):
    import socket

    # xdist worker_id is "master" in single-process mode or "gw0".."gwN" otherwise.
+    # Under subprocess-per-file isolation there's no xdist, so we fall back
+    # to "master" via the session-scoped fixture below.
+    worker_id = request.getfixturevalue("worker_id") if "worker_id" in request.fixturenames else "master"
    if worker_id == "master":
        port_offset = 0
    else:
--- a/tests/tools/test_discord_tool.py
+++ b/tests/tools/test_discord_tool.py
@ -1089,9 +1089,17 @@ class Test403Enrichment:
 class TestModelToolsIntegration:
    def setup_method(self):
        _reset_capability_cache()
+        from model_tools import _clear_tool_defs_cache
+        from tools.registry import invalidate_check_fn_cache
+        _clear_tool_defs_cache()
+        invalidate_check_fn_cache()

    def teardown_method(self):
        _reset_capability_cache()
+        from model_tools import _clear_tool_defs_cache
+        from tools.registry import invalidate_check_fn_cache
+        _clear_tool_defs_cache()
+        invalidate_check_fn_cache()

    @patch("tools.discord_tool._discord_request")
    def test_discord_admin_schema_rebuilt_by_get_tool_definitions(
--- a/tests/tools/test_homeassistant_tool.py
+++ b/tests/tools/test_homeassistant_tool.py
@ -501,16 +501,18 @@ class TestRegistration:

    def test_check_fn_gates_availability(self, monkeypatch):
        """Registry should exclude HA tools when HASS_TOKEN is not set."""
-        from tools.registry import registry
+        from tools.registry import invalidate_check_fn_cache, registry

        monkeypatch.delenv("HASS_TOKEN", raising=False)
+        invalidate_check_fn_cache()
        defs = registry.get_definitions({"ha_list_entities", "ha_get_state", "ha_call_service"})
        assert len(defs) == 0

    def test_check_fn_includes_when_token_set(self, monkeypatch):
        """Registry should include HA tools when HASS_TOKEN is set."""
-        from tools.registry import registry
+        from tools.registry import invalidate_check_fn_cache, registry

        monkeypatch.setenv("HASS_TOKEN", "test-token")
+        invalidate_check_fn_cache()
        defs = registry.get_definitions({"ha_list_entities", "ha_get_state", "ha_call_service"})
        assert len(defs) == 3
--- a/tests/tools/test_kanban_tools.py
+++ b/tests/tools/test_kanban_tools.py
@ -1093,6 +1093,11 @@ def test_kanban_guidance_not_in_normal_prompt(monkeypatch, tmp_path):
    from pathlib import Path as _P
    monkeypatch.setattr(_P, "home", lambda: tmp_path)

+    from tools.registry import invalidate_check_fn_cache
+    from model_tools import _clear_tool_defs_cache
+    invalidate_check_fn_cache()
+    _clear_tool_defs_cache()
+
    from run_agent import AIAgent
    a = AIAgent(
        api_key="test",
@ -1116,6 +1121,11 @@ def test_kanban_guidance_in_worker_prompt(monkeypatch, tmp_path):
    from pathlib import Path as _P
    monkeypatch.setattr(_P, "home", lambda: tmp_path)

+    from tools.registry import invalidate_check_fn_cache
+    from model_tools import _clear_tool_defs_cache
+    invalidate_check_fn_cache()
+    _clear_tool_defs_cache()
+
    from run_agent import AIAgent
    a = AIAgent(
        api_key="test",
--- a/tests/tools/test_send_message_tool.py
+++ b/tests/tools/test_send_message_tool.py
@ -10,6 +10,12 @@ from unittest.mock import AsyncMock, MagicMock, patch

 import pytest

+# python-telegram-bot is an optional dep — skip the entire module when
+# it isn't installed (e.g. CI bare env). Tests that patch telegram.Bot
+# or call _send_telegram need it; tests for other platforms don't but
+# keeping the whole file consistent is simpler.
+_HAS_TELEGRAM = pytest.importorskip("telegram", reason="python-telegram-bot not installed") is not None
+

@pytest.fixture(autouse=True)
 def _reset_signal_scheduler():
--- a/tests/tools/test_terminal_tool_requirements.py
+++ b/tests/tools/test_terminal_tool_requirements.py
@ -2,11 +2,26 @@

 import importlib

+import pytest
+
 from model_tools import get_tool_definitions

 terminal_tool_module = importlib.import_module("tools.terminal_tool")


+@pytest.fixture(autouse=True)
+def _clear_caches():
+    """Invalidate check_fn and tool-definitions caches before each test
+    so that monkeypatched env vars / config take effect."""
+    from tools.registry import invalidate_check_fn_cache
+    from model_tools import _clear_tool_defs_cache
+    invalidate_check_fn_cache()
+    _clear_tool_defs_cache()
+    yield
+    invalidate_check_fn_cache()
+    _clear_tool_defs_cache()
+
+
 class TestTerminalRequirements:
    def test_local_backend_requirements(self, monkeypatch):
        monkeypatch.setattr(
--- a/tests/tools/test_video_generation_tool_surface_matrix.py
+++ b/tests/tools/test_video_generation_tool_surface_matrix.py
@ -95,7 +95,9 @@ def _invoke_tool(home, cfg: dict, args: dict) -> dict:
    if hasattr(cfg_mod, "_invalidate_load_config_cache"):
        cfg_mod._invalidate_load_config_cache()

-    from tools.registry import registry
+    from tools.registry import discover_builtin_tools, registry
+    if "video_generate" not in registry._tools:
+        discover_builtin_tools()
    handler = registry._tools["video_generate"].handler
    return json.loads(handler(args))

--- a/tests/tools/test_web_providers.py
+++ b/tests/tools/test_web_providers.py
@ -13,6 +13,8 @@ from typing import Any, Dict, List

 import pytest

+from tests.tools.conftest import register_all_web_providers
+

 # ---------------------------------------------------------------------------
 # ABC enforcement
@ -276,6 +278,15 @@ class TestUnconfiguredErrorEnvelopeParity:
    ``result.get("error")`` detect the failure cleanly.
    """

+    _register_providers = staticmethod(register_all_web_providers)
+
+    @pytest.fixture(autouse=True)
+    def _populate_web_registry(self):
+        self._register_providers()
+        yield
+        from agent.web_search_registry import _reset_for_tests
+        _reset_for_tests()
+
    def _clear_web_creds(self, monkeypatch):
        for k in (
            "BRAVE_SEARCH_API_KEY",
--- a/tests/tools/test_web_providers_brave_free.py
+++ b/tests/tools/test_web_providers_brave_free.py
@ -15,6 +15,10 @@ from __future__ import annotations
 import json
 from unittest.mock import MagicMock, patch

+import pytest
+
+from tests.tools.conftest import register_all_web_providers
+

 # ---------------------------------------------------------------------------
 # BraveFreeWebSearchProvider unit tests
@ -239,6 +243,15 @@ class TestBraveFreeBackendWiring:


 class TestBraveFreeSearchOnlyErrors:
+    _register_providers = staticmethod(register_all_web_providers)
+
+    @pytest.fixture(autouse=True)
+    def _populate_web_registry(self):
+        self._register_providers()
+        yield
+        from agent.web_search_registry import _reset_for_tests
+        _reset_for_tests()
+
    def test_web_extract_returns_search_only_error(self, monkeypatch):
        import asyncio
        from tools import web_tools
@ -246,6 +259,7 @@ class TestBraveFreeSearchOnlyErrors:
        monkeypatch.setattr(web_tools, "_load_web_config", lambda: {"backend": "brave-free"})
        monkeypatch.setenv("BRAVE_SEARCH_API_KEY", "BSAkey123")
        monkeypatch.setattr(web_tools, "_is_tool_gateway_ready", lambda: False)
+        monkeypatch.setattr(web_tools, "is_safe_url", lambda url: True)
        monkeypatch.setattr("tools.interrupt.is_interrupted", lambda: False, raising=False)

        result_str = asyncio.get_event_loop().run_until_complete(
@ -264,6 +278,8 @@ class TestBraveFreeSearchOnlyErrors:
        monkeypatch.setenv("BRAVE_SEARCH_API_KEY", "BSAkey123")
        monkeypatch.setattr(web_tools, "_is_tool_gateway_ready", lambda: False)
        monkeypatch.setattr(web_tools, "check_firecrawl_api_key", lambda: False)
+        monkeypatch.setattr(web_tools, "is_safe_url", lambda url: True)
+        monkeypatch.setattr(web_tools, "check_website_access", lambda url: None)
        monkeypatch.setattr("tools.interrupt.is_interrupted", lambda: False, raising=False)

        result_str = asyncio.get_event_loop().run_until_complete(
--- a/tests/tools/test_web_providers_ddgs.py
+++ b/tests/tools/test_web_providers_ddgs.py
@ -14,6 +14,10 @@ import sys
 import types
 from unittest.mock import MagicMock

+import pytest
+
+from tests.tools.conftest import register_all_web_providers
+

 def _install_fake_ddgs(monkeypatch, *, text_results=None, text_raises=None):
    """Install a stub ``ddgs`` module in sys.modules for the duration of a test.
@ -210,6 +214,15 @@ class TestDDGSBackendWiring:


 class TestDDGSSearchOnlyErrors:
+    _register_providers = staticmethod(register_all_web_providers)
+
+    @pytest.fixture(autouse=True)
+    def _populate_web_registry(self):
+        self._register_providers()
+        yield
+        from agent.web_search_registry import _reset_for_tests
+        _reset_for_tests()
+
    def test_web_extract_returns_search_only_error(self, monkeypatch):
        import asyncio
        from tools import web_tools
@ -217,6 +230,7 @@ class TestDDGSSearchOnlyErrors:
        monkeypatch.setattr(web_tools, "_load_web_config", lambda: {"backend": "ddgs"})
        monkeypatch.setattr(web_tools, "_ddgs_package_importable", lambda: True)
        monkeypatch.setattr(web_tools, "_is_tool_gateway_ready", lambda: False)
+        monkeypatch.setattr(web_tools, "is_safe_url", lambda url: True)
        monkeypatch.setattr("tools.interrupt.is_interrupted", lambda: False, raising=False)

        result_str = asyncio.get_event_loop().run_until_complete(
@ -235,6 +249,8 @@ class TestDDGSSearchOnlyErrors:
        monkeypatch.setattr(web_tools, "_ddgs_package_importable", lambda: True)
        monkeypatch.setattr(web_tools, "_is_tool_gateway_ready", lambda: False)
        monkeypatch.setattr(web_tools, "check_firecrawl_api_key", lambda: False)
+        monkeypatch.setattr(web_tools, "is_safe_url", lambda url: True)
+        monkeypatch.setattr(web_tools, "check_website_access", lambda url: None)
        monkeypatch.setattr("tools.interrupt.is_interrupted", lambda: False, raising=False)

        result_str = asyncio.get_event_loop().run_until_complete(
--- a/tests/tools/test_web_providers_searxng.py
+++ b/tests/tools/test_web_providers_searxng.py
@ -17,6 +17,8 @@ from unittest.mock import MagicMock, patch

 import pytest

+from tests.tools.conftest import register_all_web_providers
+

 # ---------------------------------------------------------------------------
 # SearXNGWebSearchProvider unit tests
@ -301,6 +303,15 @@ class TestCheckWebApiKey:
 class TestSearXNGOnlyExtractCrawlErrors:
    """When searxng is the active backend, extract/crawl must return clear errors."""

+    _register_providers = staticmethod(register_all_web_providers)
+
+    @pytest.fixture(autouse=True)
+    def _populate_web_registry(self):
+        self._register_providers()
+        yield
+        from agent.web_search_registry import _reset_for_tests
+        _reset_for_tests()
+
    def test_web_crawl_searxng_returns_clear_error(self, monkeypatch):
        import asyncio
        from tools import web_tools
@ -309,6 +320,8 @@ class TestSearXNGOnlyExtractCrawlErrors:
        monkeypatch.setenv("SEARXNG_URL", "http://localhost:8080")
        monkeypatch.setattr(web_tools, "_is_tool_gateway_ready", lambda: False)
        monkeypatch.setattr(web_tools, "check_firecrawl_api_key", lambda: False)
+        monkeypatch.setattr(web_tools, "is_safe_url", lambda url: True)
+        monkeypatch.setattr(web_tools, "check_website_access", lambda url: None)
        monkeypatch.setattr("tools.interrupt.is_interrupted", lambda: False, raising=False)

        import json
@ -326,6 +339,7 @@ class TestSearXNGOnlyExtractCrawlErrors:
        monkeypatch.setattr(web_tools, "_load_web_config", lambda: {"backend": "searxng"})
        monkeypatch.setenv("SEARXNG_URL", "http://localhost:8080")
        monkeypatch.setattr(web_tools, "_is_tool_gateway_ready", lambda: False)
+        monkeypatch.setattr(web_tools, "is_safe_url", lambda url: True)
        monkeypatch.setattr("tools.interrupt.is_interrupted", lambda: False, raising=False)

        import json
--- a/tests/tools/test_web_tools_tavily.py
+++ b/tests/tools/test_web_tools_tavily.py
@ -13,6 +13,8 @@ import asyncio
 import pytest
 from unittest.mock import patch, MagicMock

+from tests.tools.conftest import register_all_web_providers
+

 # ─── _tavily_request ─────────────────────────────────────────────────────────

@ -163,6 +165,15 @@ class TestNormalizeTavilyDocuments:
 class TestWebSearchTavily:
    """Test web_search_tool dispatch to Tavily."""

+    _register_providers = staticmethod(register_all_web_providers)
+
+    @pytest.fixture(autouse=True)
+    def _populate_web_registry(self):
+        self._register_providers()
+        yield
+        from agent.web_search_registry import _reset_for_tests
+        _reset_for_tests()
+
    def test_search_dispatches_to_tavily(self):
        mock_response = MagicMock()
        mock_response.json.return_value = {
@ -186,6 +197,15 @@ class TestWebSearchTavily:
 class TestWebExtractTavily:
    """Test web_extract_tool dispatch to Tavily."""

+    _register_providers = staticmethod(register_all_web_providers)
+
+    @pytest.fixture(autouse=True)
+    def _populate_web_registry(self):
+        self._register_providers()
+        yield
+        from agent.web_search_registry import _reset_for_tests
+        _reset_for_tests()
+
    def test_extract_dispatches_to_tavily(self):
        mock_response = MagicMock()
        mock_response.json.return_value = {
@ -211,6 +231,15 @@ class TestWebExtractTavily:
 class TestWebCrawlTavily:
    """Test web_crawl_tool dispatch to Tavily."""

+    _register_providers = staticmethod(register_all_web_providers)
+
+    @pytest.fixture(autouse=True)
+    def _populate_web_registry(self):
+        self._register_providers()
+        yield
+        from agent.web_search_registry import _reset_for_tests
+        _reset_for_tests()
+
    def test_crawl_dispatches_to_tavily(self):
        mock_response = MagicMock()
        mock_response.json.return_value = {
--- a/tests/tools/test_website_policy.py
+++ b/tests/tools/test_website_policy.py
@ -4,6 +4,8 @@ from pathlib import Path
 import pytest
 import yaml

+from tests.tools.conftest import register_all_web_providers
+
 from tools.website_policy import WebsitePolicyError, check_website_access, load_website_blocklist


@ -347,40 +349,191 @@ def test_browser_navigate_allows_when_shared_file_missing(monkeypatch, tmp_path)
    assert result is None


-@pytest.mark.asyncio
-async def test_web_extract_short_circuits_blocked_url(monkeypatch):
-    from tools import web_tools
-    from plugins.web.firecrawl import provider as firecrawl_provider
+class TestWebToolPolicy:
+    """Tests that exercise web_extract_tool / web_crawl_tool with website-policy gates.

-    # Allow test URLs past SSRF check so website policy is what gets tested
-    monkeypatch.setattr(web_tools, "is_safe_url", lambda url: True)
-    # The per-URL website-policy gate moved into the firecrawl plugin's
-    # extract() during the web-provider migration. Patch it at the new
-    # location; the dispatcher-level gate (used by web_crawl_tool's
-    # pre-flight) still lives on tools.web_tools.
-    monkeypatch.setattr(
-        firecrawl_provider,
-        "check_website_access",
-        lambda url: {
-            "host": "blocked.test",
-            "rule": "blocked.test",
-            "source": "config",
-            "message": "Blocked by website policy",
-        },
-    )
-    monkeypatch.setattr(
-        firecrawl_provider,
-        "_get_firecrawl_client",
-        lambda: pytest.fail("firecrawl should not run for blocked URL"),
-    )
-    monkeypatch.setattr("tools.interrupt.is_interrupted", lambda: False)
-    # Force the firecrawl plugin to be the active extract provider.
-    monkeypatch.setenv("FIRECRAWL_API_KEY", "fake-key")
+    These tests need the bundled web providers to be registered in the
+    agent.web_search_registry so the tool dispatchers can find an active
+    provider.  Without registration, the tools return an error dict that
+    lacks a ``results`` key, causing ``KeyError``.
+    """

-    result = json.loads(await web_tools.web_extract_tool(["https://blocked.test"], use_llm_processing=False))
+    _register_providers = staticmethod(register_all_web_providers)

-    assert result["results"][0]["url"] == "https://blocked.test"
-    assert "Blocked by website policy" in result["results"][0]["error"]
+    @pytest.fixture(autouse=True)
+    def _populate_web_registry(self):
+        self._register_providers()
+        yield
+        from agent.web_search_registry import _reset_for_tests
+        _reset_for_tests()
+
+    @pytest.mark.asyncio
+    async def test_web_extract_short_circuits_blocked_url(self, monkeypatch):
+        from tools import web_tools
+        from plugins.web.firecrawl import provider as firecrawl_provider
+
+        # Allow test URLs past SSRF check so website policy is what gets tested
+        monkeypatch.setattr(web_tools, "is_safe_url", lambda url: True)
+        # The per-URL website-policy gate moved into the firecrawl plugin's
+        # extract() during the web-provider migration. Patch it at the new
+        # location; the dispatcher-level gate (used by web_crawl_tool's
+        # pre-flight) still lives on tools.web_tools.
+        monkeypatch.setattr(
+            firecrawl_provider,
+            "check_website_access",
+            lambda url: {
+                "host": "blocked.test",
+                "rule": "blocked.test",
+                "source": "config",
+                "message": "Blocked by website policy",
+            },
+        )
+        monkeypatch.setattr(
+            firecrawl_provider,
+            "_get_firecrawl_client",
+            lambda: pytest.fail("firecrawl should not run for blocked URL"),
+        )
+        monkeypatch.setattr("tools.interrupt.is_interrupted", lambda: False)
+        # Force the firecrawl plugin to be the active extract provider.
+        monkeypatch.setenv("FIRECRAWL_API_KEY", "fake-key")
+
+        result = json.loads(await web_tools.web_extract_tool(["https://blocked.test"], use_llm_processing=False))
+
+        assert result["results"][0]["url"] == "https://blocked.test"
+        assert "Blocked by website policy" in result["results"][0]["error"]
+
+    @pytest.mark.asyncio
+    async def test_web_extract_blocks_redirected_final_url(self, monkeypatch):
+        from tools import web_tools
+        from plugins.web.firecrawl import provider as firecrawl_provider
+
+        # Allow test URLs past SSRF check so website policy is what gets tested
+        monkeypatch.setattr(web_tools, "is_safe_url", lambda url: True)
+
+        def fake_check(url):
+            if url == "https://allowed.test":
+                return None
+            if url == "https://blocked.test/final":
+                return {
+                    "host": "blocked.test",
+                    "rule": "blocked.test",
+                    "source": "config",
+                    "message": "Blocked by website policy",
+                }
+            pytest.fail(f"unexpected URL checked: {url}")
+
+        class FakeFirecrawlClient:
+            def scrape(self, url, formats):
+                return {
+                    "markdown": "secret content",
+                    "metadata": {
+                        "title": "Redirected",
+                        "sourceURL": "https://blocked.test/final",
+                    },
+                }
+
+        # After the web-provider migration, the per-URL gate + firecrawl client
+        # live in the plugin. Patch both at the plugin location.
+        monkeypatch.setattr(firecrawl_provider, "check_website_access", fake_check)
+        monkeypatch.setattr(firecrawl_provider, "_get_firecrawl_client", lambda: FakeFirecrawlClient())
+        monkeypatch.setattr("tools.interrupt.is_interrupted", lambda: False)
+        monkeypatch.setenv("FIRECRAWL_API_KEY", "fake-key")
+
+        result = json.loads(await web_tools.web_extract_tool(["https://allowed.test"], use_llm_processing=False))
+
+        assert result["results"][0]["url"] == "https://blocked.test/final"
+        assert result["results"][0]["content"] == ""
+        assert result["results"][0]["blocked_by_policy"]["rule"] == "blocked.test"
+
+    @pytest.mark.asyncio
+    async def test_web_crawl_short_circuits_blocked_url(self, monkeypatch):
+        from tools import web_tools
+
+        # web_crawl_tool checks for Firecrawl env before website policy
+        monkeypatch.setenv("FIRECRAWL_API_KEY", "fake-key")
+        # Allow test URLs past SSRF check so website policy is what gets tested
+        monkeypatch.setattr(web_tools, "is_safe_url", lambda url: True)
+        # The dispatcher-level (seed-URL) policy gate still lives on web_tools.
+        # No per-page gate runs in this test because the dispatcher returns
+        # immediately when the seed is blocked, before delegating to the plugin.
+        monkeypatch.setattr(
+            web_tools,
+            "check_website_access",
+            lambda url: {
+                "host": "blocked.test",
+                "rule": "blocked.test",
+                "source": "config",
+                "message": "Blocked by website policy",
+            },
+        )
+        # If the dispatcher ever reaches the firecrawl plugin's crawl(), the test
+        # fails — pin the plugin module's client lookup so we'd notice.
+        from plugins.web.firecrawl import provider as firecrawl_provider
+        monkeypatch.setattr(
+            firecrawl_provider,
+            "_get_firecrawl_client",
+            lambda: pytest.fail("firecrawl plugin should not run for blocked crawl URL"),
+        )
+        monkeypatch.setattr("tools.interrupt.is_interrupted", lambda: False)
+
+        result = json.loads(await web_tools.web_crawl_tool("https://blocked.test", use_llm_processing=False))
+
+        assert result["results"][0]["url"] == "https://blocked.test"
+        assert result["results"][0]["blocked_by_policy"]["rule"] == "blocked.test"
+
+    @pytest.mark.asyncio
+    async def test_web_crawl_blocks_redirected_final_url(self, monkeypatch):
+        from tools import web_tools
+        from plugins.web.firecrawl import provider as firecrawl_provider
+
+        # Force the firecrawl plugin to be the active crawl provider.
+        monkeypatch.setenv("FIRECRAWL_API_KEY", "fake-key")
+        # Allow test URLs past SSRF check so website policy is what gets tested
+        monkeypatch.setattr(web_tools, "is_safe_url", lambda url: True)
+
+        def fake_check(url):
+            # Dispatcher seed-URL gate (web_tools.check_website_access call)
+            # and plugin per-page gate (firecrawl_provider.check_website_access
+            # call) both flow through this single fake_check.
+            if url == "https://allowed.test":
+                return None
+            if url == "https://blocked.test/final":
+                return {
+                    "host": "blocked.test",
+                    "rule": "blocked.test",
+                    "source": "config",
+                    "message": "Blocked by website policy",
+                }
+            pytest.fail(f"unexpected URL checked: {url}")
+
+        class FakeCrawlClient:
+            def crawl(self, url, **kwargs):
+                return {
+                    "data": [
+                        {
+                            "markdown": "secret crawl content",
+                            "metadata": {
+                                "title": "Redirected crawl page",
+                                "sourceURL": "https://blocked.test/final",
+                            },
+                        }
+                    ]
+                }
+
+        # After PR #25182 follow-up: per-page policy gate lives in
+        # plugins.web.firecrawl.provider.crawl(). Patch the gate + client at
+        # the plugin location. The dispatcher-level (seed) gate also reads
+        # web_tools.check_website_access — patch both.
+        monkeypatch.setattr(web_tools, "check_website_access", fake_check)
+        monkeypatch.setattr(firecrawl_provider, "check_website_access", fake_check)
+        monkeypatch.setattr(firecrawl_provider, "_get_firecrawl_client", lambda: FakeCrawlClient())
+        monkeypatch.setattr("tools.interrupt.is_interrupted", lambda: False)
+
+        result = json.loads(await web_tools.web_crawl_tool("https://allowed.test", use_llm_processing=False))
+
+        assert result["results"][0]["content"] == ""
+        assert result["results"][0]["error"] == "Blocked by website policy"
+        assert result["results"][0]["blocked_by_policy"]["rule"] == "blocked.test"


 def test_check_website_access_fails_open_on_malformed_config(tmp_path, monkeypatch):
@ -400,139 +553,3 @@ def test_check_website_access_fails_open_on_malformed_config(tmp_path, monkeypat
    # With default path, errors are caught and fail open
    result = check_website_access("https://example.com")
    assert result is None  # allowed, not crashed
-
-
-@pytest.mark.asyncio
-async def test_web_extract_blocks_redirected_final_url(monkeypatch):
-    from tools import web_tools
-    from plugins.web.firecrawl import provider as firecrawl_provider
-
-    # Allow test URLs past SSRF check so website policy is what gets tested
-    monkeypatch.setattr(web_tools, "is_safe_url", lambda url: True)
-
-    def fake_check(url):
-        if url == "https://allowed.test":
-            return None
-        if url == "https://blocked.test/final":
-            return {
-                "host": "blocked.test",
-                "rule": "blocked.test",
-                "source": "config",
-                "message": "Blocked by website policy",
-            }
-        pytest.fail(f"unexpected URL checked: {url}")
-
-    class FakeFirecrawlClient:
-        def scrape(self, url, formats):
-            return {
-                "markdown": "secret content",
-                "metadata": {
-                    "title": "Redirected",
-                    "sourceURL": "https://blocked.test/final",
-                },
-            }
-
-    # After the web-provider migration, the per-URL gate + firecrawl client
-    # live in the plugin. Patch both at the plugin location.
-    monkeypatch.setattr(firecrawl_provider, "check_website_access", fake_check)
-    monkeypatch.setattr(firecrawl_provider, "_get_firecrawl_client", lambda: FakeFirecrawlClient())
-    monkeypatch.setattr("tools.interrupt.is_interrupted", lambda: False)
-    monkeypatch.setenv("FIRECRAWL_API_KEY", "fake-key")
-
-    result = json.loads(await web_tools.web_extract_tool(["https://allowed.test"], use_llm_processing=False))
-
-    assert result["results"][0]["url"] == "https://blocked.test/final"
-    assert result["results"][0]["content"] == ""
-    assert result["results"][0]["blocked_by_policy"]["rule"] == "blocked.test"
-
-
-@pytest.mark.asyncio
-async def test_web_crawl_short_circuits_blocked_url(monkeypatch):
-    from tools import web_tools
-
-    # web_crawl_tool checks for Firecrawl env before website policy
-    monkeypatch.setenv("FIRECRAWL_API_KEY", "fake-key")
-    # Allow test URLs past SSRF check so website policy is what gets tested
-    monkeypatch.setattr(web_tools, "is_safe_url", lambda url: True)
-    # The dispatcher-level (seed-URL) policy gate still lives on web_tools.
-    # No per-page gate runs in this test because the dispatcher returns
-    # immediately when the seed is blocked, before delegating to the plugin.
-    monkeypatch.setattr(
-        web_tools,
-        "check_website_access",
-        lambda url: {
-            "host": "blocked.test",
-            "rule": "blocked.test",
-            "source": "config",
-            "message": "Blocked by website policy",
-        },
-    )
-    # If the dispatcher ever reaches the firecrawl plugin's crawl(), the test
-    # fails — pin the plugin module's client lookup so we'd notice.
-    from plugins.web.firecrawl import provider as firecrawl_provider
-    monkeypatch.setattr(
-        firecrawl_provider,
-        "_get_firecrawl_client",
-        lambda: pytest.fail("firecrawl plugin should not run for blocked crawl URL"),
-    )
-    monkeypatch.setattr("tools.interrupt.is_interrupted", lambda: False)
-
-    result = json.loads(await web_tools.web_crawl_tool("https://blocked.test", use_llm_processing=False))
-
-    assert result["results"][0]["url"] == "https://blocked.test"
-    assert result["results"][0]["blocked_by_policy"]["rule"] == "blocked.test"
-
-
-@pytest.mark.asyncio
-async def test_web_crawl_blocks_redirected_final_url(monkeypatch):
-    from tools import web_tools
-    from plugins.web.firecrawl import provider as firecrawl_provider
-
-    # Force the firecrawl plugin to be the active crawl provider.
-    monkeypatch.setenv("FIRECRAWL_API_KEY", "fake-key")
-    # Allow test URLs past SSRF check so website policy is what gets tested
-    monkeypatch.setattr(web_tools, "is_safe_url", lambda url: True)
-
-    def fake_check(url):
-        # Dispatcher seed-URL gate (web_tools.check_website_access call)
-        # and plugin per-page gate (firecrawl_provider.check_website_access
-        # call) both flow through this single fake_check.
-        if url == "https://allowed.test":
-            return None
-        if url == "https://blocked.test/final":
-            return {
-                "host": "blocked.test",
-                "rule": "blocked.test",
-                "source": "config",
-                "message": "Blocked by website policy",
-            }
-        pytest.fail(f"unexpected URL checked: {url}")
-
-    class FakeCrawlClient:
-        def crawl(self, url, **kwargs):
-            return {
-                "data": [
-                    {
-                        "markdown": "secret crawl content",
-                        "metadata": {
-                            "title": "Redirected crawl page",
-                            "sourceURL": "https://blocked.test/final",
-                        },
-                    }
-                ]
-            }
-
-    # After PR #25182 follow-up: per-page policy gate lives in
-    # plugins.web.firecrawl.provider.crawl(). Patch the gate + client at
-    # the plugin location. The dispatcher-level (seed) gate also reads
-    # web_tools.check_website_access — patch both.
-    monkeypatch.setattr(web_tools, "check_website_access", fake_check)
-    monkeypatch.setattr(firecrawl_provider, "check_website_access", fake_check)
-    monkeypatch.setattr(firecrawl_provider, "_get_firecrawl_client", lambda: FakeCrawlClient())
-    monkeypatch.setattr("tools.interrupt.is_interrupted", lambda: False)
-
-    result = json.loads(await web_tools.web_crawl_tool("https://allowed.test", use_llm_processing=False))
-
-    assert result["results"][0]["content"] == ""
-    assert result["results"][0]["error"] == "Blocked by website policy"
-    assert result["results"][0]["blocked_by_policy"]["rule"] == "blocked.test"
--- a/tests/tools/test_write_deny.py
+++ b/tests/tools/test_write_deny.py
@ -1,8 +1,10 @@
 """Tests for _is_write_denied() — verifies deny list blocks sensitive paths on all platforms."""

 import os
+
 import pytest
 from pathlib import Path
+from unittest.mock import patch

 from tools.file_operations import _is_write_denied

@ -97,8 +99,22 @@ class TestWriteDenyPrefixes:
    def test_sudoers_d_prefix(self):
        assert _is_write_denied("/etc/sudoers.d/custom") is True

-    def test_systemd_prefix(self):
-        assert _is_write_denied("/etc/systemd/system/evil.service") is True
+    def test_systemd_prefix(self, tmp_path):
+        # On NixOS, /etc/systemd is a symlink into /nix/store, so
+        # realpath() resolves it to a store path that doesn't match
+        # the /etc/systemd/ prefix.  Build a real directory tree so
+        # realpath is a no-op and prefix matching works.
+        fake_etc = tmp_path / "etc" / "systemd" / "system"
+        fake_etc.mkdir(parents=True)
+        target = str(fake_etc / "evil.service")
+        # Patch the prefix builder to include our tmp_path prefix
+        import agent.file_safety as _fs
+        _orig = _fs.build_write_denied_prefixes
+        _extra_prefix = str(tmp_path / "etc" / "systemd") + os.sep
+        def _patched(home):
+            return _orig(home) + [_extra_prefix]
+        with patch.object(_fs, "build_write_denied_prefixes", _patched):
+            assert _is_write_denied(target) is True


 class TestWriteAllowed:
--- a/uv.lock
+++ b/uv.lock
@ -1261,15 +1261,6 @@ wheels = [
    { url = "https://files.pythonhosted.org/packages/e2/bc/7a34e904a415040ba626948d0b0a36a08cd073f12b13342578a68331be3c/exa_py-2.10.2-py3-none-any.whl", hash = "sha256:ecb2a7581f4b7a8aeb6b434acce1bbc40f92ed1d4126b2aa6029913acd904a47", size = 72248, upload-time = "2026-03-26T20:29:37.306Z" },
 ]

-[[package]]
-name = "execnet"
-version = "2.1.2"
-source = { registry = "https://pypi.org/simple" }
-sdist = { url = "https://files.pythonhosted.org/packages/bf/89/780e11f9588d9e7128a3f87788354c7946a9cbb1401ad38a48c4db9a4f07/execnet-2.1.2.tar.gz", hash = "sha256:63d83bfdd9a23e35b9c6a3261412324f964c2ec8dcd8d3c6916ee9373e0befcd", size = 166622, upload-time = "2025-11-12T09:56:37.75Z" }
-wheels = [
-    { url = "https://files.pythonhosted.org/packages/ab/84/02fc1827e8cdded4aa65baef11296a9bbe595c474f0d6d758af082d849fd/execnet-2.1.2-py3-none-any.whl", hash = "sha256:67fba928dd5a544b783f6056f449e5e3931a5c378b128bc18501f7ea79e296ec", size = 40708, upload-time = "2025-11-12T09:56:36.333Z" },
-]
-
 [[package]]
 name = "fal-client"
 version = "0.13.1"
@ -1635,9 +1626,7 @@ all = [
    { name = "ptyprocess", marker = "sys_platform != 'win32'" },
    { name = "pytest" },
    { name = "pytest-asyncio" },
-    { name = "pytest-split" },
    { name = "pytest-timeout" },
-    { name = "pytest-xdist" },
    { name = "pywinpty", marker = "sys_platform == 'win32'" },
    { name = "ruff" },
    { name = "simple-term-menu" },
@ -1668,9 +1657,7 @@ dev = [
    { name = "mcp" },
    { name = "pytest" },
    { name = "pytest-asyncio" },
-    { name = "pytest-split" },
    { name = "pytest-timeout" },
-    { name = "pytest-xdist" },
    { name = "ruff" },
    { name = "ty" },
 ]
@ -1863,9 +1850,7 @@ requires-dist = [
    { name = "pyjwt", extras = ["crypto"], specifier = "==2.12.1" },
    { name = "pytest", marker = "extra == 'dev'", specifier = "==9.0.2" },
    { name = "pytest-asyncio", marker = "extra == 'dev'", specifier = "==1.3.0" },
-    { name = "pytest-split", marker = "extra == 'dev'", specifier = "==0.11.0" },
    { name = "pytest-timeout", marker = "extra == 'dev'", specifier = "==2.4.0" },
-    { name = "pytest-xdist", marker = "extra == 'dev'", specifier = "==3.8.0" },
    { name = "python-dotenv", specifier = "==1.2.2" },
    { name = "python-telegram-bot", extras = ["webhooks"], marker = "extra == 'messaging'", specifier = "==22.6" },
    { name = "python-telegram-bot", extras = ["webhooks"], marker = "extra == 'termux'", specifier = "==22.6" },
@ -3482,18 +3467,6 @@ wheels = [
    { url = "https://files.pythonhosted.org/packages/e5/35/f8b19922b6a25bc0880171a2f1a003eaeb93657475193ab516fd87cac9da/pytest_asyncio-1.3.0-py3-none-any.whl", hash = "sha256:611e26147c7f77640e6d0a92a38ed17c3e9848063698d5c93d5aa7aa11cebff5", size = 15075, upload-time = "2025-11-10T16:07:45.537Z" },
 ]

-[[package]]
-name = "pytest-split"
-version = "0.11.0"
-source = { registry = "https://pypi.org/simple" }
-dependencies = [
-    { name = "pytest" },
-]
-sdist = { url = "https://files.pythonhosted.org/packages/2f/16/8af4c5f2ceb3640bb1f78dfdf5c184556b10dfe9369feaaad7ff1c13f329/pytest_split-0.11.0.tar.gz", hash = "sha256:8ebdb29cc72cc962e8eb1ec07db1eeb98ab25e215ed8e3216f6b9fc7ce0ec2b5", size = 13421, upload-time = "2026-02-03T09:14:31.469Z" }
-wheels = [
-    { url = "https://files.pythonhosted.org/packages/ae/a1/d4423657caaa8be9b31e491592b49cebdcfd434d3e74512ce71f6ec39905/pytest_split-0.11.0-py3-none-any.whl", hash = "sha256:899d7c0f5730da91e2daf283860eb73b503259cb416851a65599368849c7f382", size = 11911, upload-time = "2026-02-03T09:14:33.708Z" },
-]
-
 [[package]]
 name = "pytest-timeout"
 version = "2.4.0"
@ -3506,19 +3479,6 @@ wheels = [
    { url = "https://files.pythonhosted.org/packages/fa/b6/3127540ecdf1464a00e5a01ee60a1b09175f6913f0644ac748494d9c4b21/pytest_timeout-2.4.0-py3-none-any.whl", hash = "sha256:c42667e5cdadb151aeb5b26d114aff6bdf5a907f176a007a30b940d3d865b5c2", size = 14382, upload-time = "2025-05-05T19:44:33.502Z" },
 ]

-[[package]]
-name = "pytest-xdist"
-version = "3.8.0"
-source = { registry = "https://pypi.org/simple" }
-dependencies = [
-    { name = "execnet" },
-    { name = "pytest" },
-]
-sdist = { url = "https://files.pythonhosted.org/packages/78/b4/439b179d1ff526791eb921115fca8e44e596a13efeda518b9d845a619450/pytest_xdist-3.8.0.tar.gz", hash = "sha256:7e578125ec9bc6050861aa93f2d59f1d8d085595d6551c2c90b6f4fad8d3a9f1", size = 88069, upload-time = "2025-07-01T13:30:59.346Z" }
-wheels = [
-    { url = "https://files.pythonhosted.org/packages/ca/31/d4e37e9e550c2b92a9cbc2e4d0b7420a27224968580b5a447f420847c975/pytest_xdist-3.8.0-py3-none-any.whl", hash = "sha256:202ca578cfeb7370784a8c33d6d05bc6e13b4f25b5053c30a152269fd10f0b88", size = 46396, upload-time = "2025-07-01T13:30:56.632Z" },
-]
-
 [[package]]
 name = "python-dateutil"
 version = "2.9.0.post0"