Commit graph

16 commits

Author SHA1 Message Date
Teknium
b508d4296e
test(ci): raise per-file timeout 140s → 300s to stop false timeouts (#54143)
* test(ci): raise per-file timeout 140s to 300s to stop false timeouts

The per-file parallel runner caps each test-file subprocess at a flat
wall-clock budget. Combined with per-test subprocess isolation (a fresh
Python process per test), a large-collection file pays N x (interpreter
startup + import) of overhead before any test logic runs. That overhead
dilates under load on shared CI runners, so a file that finishes in
~100s on a quiet box can blow the old 140s cap purely from scheduling
jitter, surfacing as a false 'no tests ran' timeout (rc=124) with zero
actual test failures.

Raise the default to 300s (5 min). The Docker build matrix jobs already
take 7-10 min, so this headroom costs nothing on total CI wall time
while still bounding a genuinely hung file.

* docs: add infographic for CI per-file timeout bump
2026-06-28 02:41:07 -07:00
Teknium
2523917680
fix(tests): bare pytest flags pass through run_tests.sh without a '--' separator (#54008)
The parallel runner only forwarded pytest args after a literal '--', so a
bare 'scripts/run_tests.sh tests/foo.py -q' (or -v/-x/-k/--tb=long) errored
out with 'unrecognized arguments'. This contradicted the docstring's
promise that common pytest flags pass through, and forced a retry on every
run that used pytest muscle-memory.

Now any token starting with '-' that isn't one of the runner's own options
(-j/--jobs, --paths, --slice, --file-timeout, --generate-slices, --files,
--include-integration) is routed to each per-file pytest invocation
automatically. Value-taking flags given space-separated (-k expr, -m mark,
-p plugin, -o name=val, etc.) keep their value instead of having it stolen
by positional-path discovery. The explicit '--' separator still works and
stacks with bare flags.

- scripts/run_tests_parallel.py: argv splitter routes bare unknown flags to
  pytest; value-flag lookahead; updated docstring.
- scripts/run_tests.sh: usage comment reflects bare-flag passthrough.
- tests/test_run_tests_parallel.py: 4 behavior-contract tests (bare -q runs,
  -k keeps its value/filters, '--' still works, positional path stays a root).
2026-06-27 22:43:26 -07:00
ethernet
dd0e4ab81a change(ci): slice files in matrix job
avoid duplicating work, avoid file discovery on each job
2026-06-26 19:15:18 -07:00
ethernet
1a75387fa8 change(ci): log json decode error in durations 2026-06-26 19:15:18 -07:00
ethernet
707ae6e623 change(tests): don't count with pytest collect
it's way too slow. just grep files lol
2026-06-26 19:15:18 -07:00
ethernet
9a861cd0ab change(tests): don't pass pytest args when counting tests 2026-06-26 19:15:18 -07:00
ethernet
fb1dd1bf91 change(ci): docker-publish.yml -> docker.yml 2026-06-26 19:15:18 -07:00
ethernet
4d68984ec7 fix(tests): remove no-longer-needed forensics 2026-06-12 13:42:42 -04:00
ethernet
2f9d18711f fix(ci): remove pytest-timeout, use per-file timeout only
fix(ci): write a new cache for test durations every time
change(ci): rip out error 4 retries because we found the real bug
2026-06-12 13:42:42 -04:00
Teknium
07ac185904
fix(ci): exit-4 forensics for vanishing test files in run_tests_parallel.py (#43646)
* fix(ci): append filesystem forensics when a per-file pytest run exhausts exit-4 retries

A PR-added test file (tests/test_iron_proxy.py, PR #30179) repeatedly
failed exactly one CI shard with 'ERROR: file or directory not found'
across 4 runs (including a fresh merge SHA on fresh runners), while the
identical slice passes locally against the same merge commit and a
tree-integrity watcher confirms no sibling test mutates the repo. Three
unrelated branches showed the same one-shard signature the same day.

We currently cannot attribute these because the log only carries
pytest's exit-4 line. This adds a forensics block to the captured
output when exit-4 survives the retry loop:

- does the file exist NOW (post-retries)
- parent dir entry count + similarly-named entries
- git status --porcelain dirty-entry count + first 10 entries

Zero behavior change: rc stays 4, retries unchanged, forensics wrapped
in a broad try/except so they can never mask the failure.

Two new tests cover the exhausted-retries and genuinely-missing paths.

* chore: drop the two forensics tests — ship the runner change only
2026-06-10 10:04:17 -07:00
Teknium
f082b4ec5c
fix(ci): make parallel runner's exit-4 retry robust for newly-added test files (#42994)
The per-file test runner re-runs a file once when pytest exits 4 ("file or
directory not found") while the file exists on disk — a transient seen on
loaded shared CI runners where the planner collects a file (--collect-only
counts its tests) but the per-file subprocess fails to stat it moments later.

A single immediate retry could land in the same brief high-load window and
fail again, and the retry was gated on one Path.exists() check that can itself
be a flaky stat under that load — so a freshly-added test file that LPT pins to
one shard would deterministically red that shard on every run (no actual test
failure; the file just never executes).

- Extract the subprocess spawn/communicate/process-tree-kill logic into a
  shared _spawn_pytest_once() helper (removes ~90 lines of duplication between
  the primary run and the retry).
- Replace the single-shot retry with a bounded backoff loop
  (_EXIT4_RETRY_ATTEMPTS, escalating sleep) that re-runs while the file is
  present on disk.
- Add _file_present() which re-checks existence across a few spaced stats, so a
  single flaky negative stat doesn't wrongly conclude the file is missing. A
  genuinely-missing file (typo/deleted) still fails fast — exit 4 is not
  swallowed when the file truly does not exist.
- Tests: transient-then-pass recovery, genuinely-missing fails fast with no
  retry, give-up after max attempts, and _file_present transient/missing cases.
2026-06-09 21:39:09 -07:00
teknium1
754154a9c2 fix(tests): retry per-file pytest subprocess once on exit-4 when the file exists
The parallel test runner sharded a present, tracked test file
(tests/plugins/platforms/photon/test_inbound.py) onto a slice that then
reported 'file or directory not found' (pytest exit 4) at exec time —
even though the planner had just enumerated the file via --collect-only
('5269 passed, 0 failed' in the same run). On loaded shared CI runners
the per-file subprocess can fail to stat a file the planner already saw;
the deterministic LPT slicer then reproduces it on every rerun because
the same file set lands on the same shard.

Fix: when a per-file run exits 4 AND the file still exists on disk, retry
the subprocess once before surfacing it as a hard failure. This kills the
shard-flake class for everyone, not just this PR.

Does NOT widen the exit-5-is-pass rule — exit 4 on a genuinely missing
file still fails (verified). Retry reuses the same pgroup-kill cleanup as
the primary run so no grandchildren orphan.

Validation: photon dir runs green through scripts/run_tests_parallel.py;
unit-level negative case confirms a nonexistent file still returns rc=4.
2026-06-08 13:38:30 -07:00
Ben
da8b2e95fd ci(docker): run tests/docker/ in build-amd64 against the freshly-built image
The new tests/docker/ suite (added by this PR) was being picked up by the
sharded pytest matrix in tests.yml, where its session-scoped `built_image`
fixture issued a 3-7min `docker build` under tests/docker/conftest.py's
180s pytest-timeout cap. Every test in the directory failed in fixture
setup across all 6 shards.

Fix the suite so it actually runs (not skips):

1. Wire the docker tests into docker-publish.yml's build-amd64 job, right
   after the existing smoke test. The image is already loaded into the
   local daemon as `nousresearch/hermes-agent:test`; set
   HERMES_TEST_IMAGE to that and the fixture's pre-built-image branch
   short-circuits the rebuild. 21 tests run in ~90s locally against a
   prebuilt image, no rebuild cost on top of the existing build step.

2. Exclude tests/docker/ from scripts/run_tests_parallel.py's default
   discovery so the sharded matrix in tests.yml stops trying to build
   the image. Explicit positional paths (`pytest tests/docker/` or
   `scripts/run_tests.sh tests/docker/`) still pick the suite up — the
   skip rule honors directory-level user intent, matching the existing
   per-file override pattern.

The dedicated docker-tests step runs on every PR that touches docker
code (the existing path filters on docker-publish.yml already cover
`tests/docker/**` via `**/*.py`), so the suite gates real changes.

(cherry picked from commit 4c481860ce)
2026-05-25 12:40:57 +10:00
teknium1
5cbb132c1d
fix(ci): exclude tests/docker/ from regular test shards; pin read_text encoding
Two CI follow-ups to @benbarclay's #30136 salvage:

1. scripts/run_tests_parallel.py — add 'docker' to _SKIP_PARTS so
   the new tests/docker/ harness doesn't run in the regular test (N)
   matrix. The harness builds the real Dockerfile in a session
   fixture, which can exceed pytest-timeout's 180s ceiling on
   ubuntu-latest where Docker IS available — it surfaced as 6
   identical setup-timeout failures across slices 1–6 on the first
   CI run.

   The docker harness has its own dedicated runner via
   .github/actions/hermes-smoke-test (added in #30136) plus the
   docker-lint workflow. Same treatment as tests/integration/ and
   tests/e2e/ — runs separately, not in the main shards.

2. hermes_cli/service_manager.py — pin encoding='utf-8' on the
   /proc/1/comm read_text call. Ruff PLW1514 enforcement rolled in
   between Ben's last push and the salvage; pure ruff-fix, no
   behavior change.
2026-05-24 18:23:13 -07:00
ethernet
b689624aee feat(ci): 4-way matrix slicing with LPT duration-balanced distribution
run_tests_parallel.py:
  - --slice I/N flag (also HERMES_TEST_SLICE env var) runs only the
    I-th slice of N, distributing files across slices by cached
    duration using LPT (Longest Processing Time first) greedy
    algorithm so each slice gets roughly equal wall time
  - Duration cache (test_durations.json): maps relative file paths to
    last-observed subprocess wall time. _save_durations merges with
    existing cache so entries from other slices are preserved.
  - Per-file subprocess timing in progress output + end-of-run
    distribution summary (percentiles, top-10 slowest, <1s/<2s counts)
  - Unknown files default to 2.0s estimate (~P50), spread evenly by LPT

.github/workflows/tests.yml:
  - Matrix strategy: slice [1, 2, 3, 4] with fail-fast: false
  - Each slice restores duration cache from main (stable key, no SHA),
    runs its portion, uploads per-slice durations as artifacts
  - save-durations job (main only, if: always()) downloads all 4
    artifacts, merges into single cache entry for future PRs
  - Timeout reduced from 60min to 30min per slice (~1/4 the work)

Cache design:
  - Stable key (test-durations) not keyed by commit SHA — durations
    are about files, not commits, and SHA-keyed caches miss on every
    new commit and on PR merge commits
  - actions/cache scoping: main's cache is visible to all PRs targeting
    main; feature branches without a cache still work (default 2.0s)
  - No dotfile prefix (upload-artifact v7 skips hidden files)
2026-05-22 19:46:18 -07:00
ethernet
48be2e0e4d
test: use subprocesses for each test file (#29016)
* ci(tests): install ripgrep from prebuilt tarball instead of apt

apt-get update + install of ripgrep takes ~4 min on the GHA Ubuntu
runners (the apt-get update against archive.ubuntu.com is the slow
part; ripgrep itself is small). Switching to the upstream musl
binary tarball cuts the step to a few seconds.

- Pinned to ripgrep 15.1.0 with sha256 verification (same hash as
  published in the releases sha256 sidecar file).
- Drops the `rg` binary into /usr/local/bin so it is on PATH for
  every subsequent step without GITHUB_PATH manipulation.
- Applied to both the test and e2e jobs in tests.yml.

* fix(cli): compile syntax check to tempdir, not source __pycache__

`_validate_critical_files_syntax` runs `py_compile.compile()` on each
critical bootstrap file after a successful `git pull`. The default
`py_compile` writes the resulting `.pyc` next to the source under
`__pycache__/`, which causes two real problems:

1. Parallel test workers walking the same source tree (e.g. running
   the suite under per-file process isolation) can race against each
   other on the `__pycache__` write — manifests as flaky 'directory
   not empty' errors during teardown.
2. In production, the post-pull syntax check leaves a `.pyc` behind
   that the next interpreter run might pick up — fine when the
   interpreter version matches, sketchy if it doesn't.

Fix: write the compiled output to a `tempfile.TemporaryDirectory()`
that's discarded on function exit. We only care about the compile-or-not
signal, not the artifact.

* test(runner): per-file process isolation, drop manual state reset + xdist

Replace fragile manual _reset_module_state test fixtures with robust
per-file subprocess isolation. Each test file runs in a fresh
`python -m pytest <file>` subprocess via ThreadPoolExecutor. No xdist,
no custom pytest plugin, no shared worker state.

Key changes:
  * scripts/run_tests_parallel.py — new runner: discovers test files,
    runs N in parallel via ThreadPoolExecutor, captures stdout per file,
    treats exit code 5 (no tests collected) as pass, kills all children
    on exit. Change from cpu_count to cpu_count*2. The runner is
    I/O-bound (waiting on subprocess.communicate() from pytest children)
    The parent process does almost no CPU work, so 2x oversubscription
    keeps more pipes full. When a file fails, immediately show the last
    30 lines of pytest output (stack traces + FAILED summary) plus a
    ready-to-copy repro command:
      python -m pytest tests/agent/test_auxiliary_client.py
  * scripts/run_tests.sh — delegates to run_tests_parallel.py
  * .github/workflows/tests.yml — test step: python
scripts/run_tests_parallel.py
  * pyproject.toml — drop pytest-xdist, pytest-split; simplify addopts
  * tests/conftest.py — remove ~200 lines of manual state-reset fixtures
  * AGENTS.md — update Testing section for per-file design

* test(runner): speed gateway test antipattern scan up

* fix(test): web search provider plugin test missing xai

* fix(tests): make 14 test files pass under per-file subprocess isolation

Tests that relied on cross-file state pollution from xdist workers
fail when run in isolation (per-file subprocess model). Root causes
and fixes:

Tool registry not populated:
  - test_video_generation_tool_surface_matrix: add discover_builtin_tools()
  - test_web_providers_brave_free/ddgs/searxng/general: autouse fixtures
    registering all 8 bundled web providers, reset after each test
  - test_website_policy: same provider registration pattern
  - test_web_tools_tavily: same pattern across 3 dispatch test classes
  - Also add is_safe_url/check_website_access mocks where SSRF check
    blocks example.com (DNS resolution fails in isolated envs)

Stale check_fn cache:
  - test_kanban_tools: invalidate_check_fn_cache() + _clear_tool_defs_cache()
    in both kanban guidance tests (prior test cached False for kanban_show)
  - test_discord_tool: cache invalidation in setup/teardown
  - test_homeassistant_tool: invalidate_check_fn_cache() before registry queries

Module-level state pollution:
  - test_auxiliary_client: autouse fixture clearing _aux_unhealthy_until cache
  - test_skill_commands: set_session_vars() instead of patch.dict(os.environ)
    (ContextVar takes precedence over os.environ)
  - test_dm_topics: overwrite sys.modules + separate telegram.constants mock
    + force-reimport of gateway.platforms.telegram
  - test_terminal_tool_requirements: removed duplicate class declaration,
    autouse _clear_caches fixture

* change(tests): run_tests.sh explicitly includes env vars

instead of manually dropping some vars, now we just only include some

* fix(tests): 5 more isolation/NixOS fixes

- test_approval_plugin_hooks: isolate HERMES_HOME so real user's
  command_allowlist doesn't short-circuit the approval path
- test_google_chat: skipif when Platform.GOOGLE_CHAT not in enum
  (feature not merged on this branch)
- test_write_deny: test systemd prefix against tmp_path instead of
  /etc/systemd which resolves to /nix/store on NixOS
- test_pty_bridge: use shutil.which('cat') instead of /bin/cat
  (doesn't exist on NixOS)
- profiles.py: rmtree onexc handler chmod's parent dirs too, fixing
  profile deletion when copytree preserved read-only modes from
  nix store

* fix(tests): clear unhealthy cache in autouse fixture for auxiliary_client

* fix(tests): skip send_message when telegram not installed; handle missing worker_id in browser_supervisor

* fix: py3.11 rmtree onexc compat + belt-and-suspenders unhealthy cache clear for expired codex test

* fix: address PR #29016 review feedback

- Remove tracked .pytest-cache/ artifact and add to .gitignore
- Fix stale 'xdist worker' comment in conftest.py
- Deduplicate web provider registration into tests/tools/conftest.py
  shared helper (register_all_web_providers), replacing 8 copy-pasted
  blocks across 6 test files
- Update PR description: remove stale recovered-test-files claim,
  fix worker count to match code (cpu_count*2)

* fix: eliminate race in stale-cache achievements test

The background scan thread could complete and overwrite _SNAPSHOT_CACHE
before evaluate_all() returned the stale data — only 10 fake sessions
made the scan finish instantly. Added scan_delay param to _FakeSessionDB
and set it to 2s in the stale-cache test so the background thread can't
win the race.
2026-05-21 16:40:04 +05:30