hermes-agent

mirror of https://github.com/NousResearch/hermes-agent.git synced 2026-07-01 12:02:05 +00:00

Author	SHA1	Message	Date
Teknium	b508d4296e	test(ci): raise per-file timeout 140s → 300s to stop false timeouts (#54143 ) * test(ci): raise per-file timeout 140s to 300s to stop false timeouts The per-file parallel runner caps each test-file subprocess at a flat wall-clock budget. Combined with per-test subprocess isolation (a fresh Python process per test), a large-collection file pays N x (interpreter startup + import) of overhead before any test logic runs. That overhead dilates under load on shared CI runners, so a file that finishes in ~100s on a quiet box can blow the old 140s cap purely from scheduling jitter, surfacing as a false 'no tests ran' timeout (rc=124) with zero actual test failures. Raise the default to 300s (5 min). The Docker build matrix jobs already take 7-10 min, so this headroom costs nothing on total CI wall time while still bounding a genuinely hung file. * docs: add infographic for CI per-file timeout bump	2026-06-28 02:41:07 -07:00
Teknium	2523917680	fix(tests): bare pytest flags pass through run_tests.sh without a '--' separator (#54008 ) The parallel runner only forwarded pytest args after a literal '--', so a bare 'scripts/run_tests.sh tests/foo.py -q' (or -v/-x/-k/--tb=long) errored out with 'unrecognized arguments'. This contradicted the docstring's promise that common pytest flags pass through, and forced a retry on every run that used pytest muscle-memory. Now any token starting with '-' that isn't one of the runner's own options (-j/--jobs, --paths, --slice, --file-timeout, --generate-slices, --files, --include-integration) is routed to each per-file pytest invocation automatically. Value-taking flags given space-separated (-k expr, -m mark, -p plugin, -o name=val, etc.) keep their value instead of having it stolen by positional-path discovery. The explicit '--' separator still works and stacks with bare flags. - scripts/run_tests_parallel.py: argv splitter routes bare unknown flags to pytest; value-flag lookahead; updated docstring. - scripts/run_tests.sh: usage comment reflects bare-flag passthrough. - tests/test_run_tests_parallel.py: 4 behavior-contract tests (bare -q runs, -k keeps its value/filters, '--' still works, positional path stays a root).	2026-06-27 22:43:26 -07:00
ethernet	dd0e4ab81a	change(ci): slice files in matrix job avoid duplicating work, avoid file discovery on each job	2026-06-26 19:15:18 -07:00
ethernet	1a75387fa8	change(ci): log json decode error in durations	2026-06-26 19:15:18 -07:00
ethernet	707ae6e623	change(tests): don't count with pytest collect it's way too slow. just grep files lol	2026-06-26 19:15:18 -07:00
ethernet	9a861cd0ab	change(tests): don't pass pytest args when counting tests	2026-06-26 19:15:18 -07:00
ethernet	fb1dd1bf91	change(ci): docker-publish.yml -> docker.yml	2026-06-26 19:15:18 -07:00
ethernet	4d68984ec7	fix(tests): remove no-longer-needed forensics	2026-06-12 13:42:42 -04:00
ethernet	2f9d18711f	fix(ci): remove pytest-timeout, use per-file timeout only fix(ci): write a new cache for test durations every time change(ci): rip out error 4 retries because we found the real bug	2026-06-12 13:42:42 -04:00
Teknium	07ac185904	fix(ci): exit-4 forensics for vanishing test files in run_tests_parallel.py (#43646 ) * fix(ci): append filesystem forensics when a per-file pytest run exhausts exit-4 retries A PR-added test file (tests/test_iron_proxy.py, PR #30179) repeatedly failed exactly one CI shard with 'ERROR: file or directory not found' across 4 runs (including a fresh merge SHA on fresh runners), while the identical slice passes locally against the same merge commit and a tree-integrity watcher confirms no sibling test mutates the repo. Three unrelated branches showed the same one-shard signature the same day. We currently cannot attribute these because the log only carries pytest's exit-4 line. This adds a forensics block to the captured output when exit-4 survives the retry loop: - does the file exist NOW (post-retries) - parent dir entry count + similarly-named entries - git status --porcelain dirty-entry count + first 10 entries Zero behavior change: rc stays 4, retries unchanged, forensics wrapped in a broad try/except so they can never mask the failure. Two new tests cover the exhausted-retries and genuinely-missing paths. * chore: drop the two forensics tests — ship the runner change only	2026-06-10 10:04:17 -07:00
Teknium	f082b4ec5c	fix(ci): make parallel runner's exit-4 retry robust for newly-added test files (#42994 ) The per-file test runner re-runs a file once when pytest exits 4 ("file or directory not found") while the file exists on disk — a transient seen on loaded shared CI runners where the planner collects a file (--collect-only counts its tests) but the per-file subprocess fails to stat it moments later. A single immediate retry could land in the same brief high-load window and fail again, and the retry was gated on one Path.exists() check that can itself be a flaky stat under that load — so a freshly-added test file that LPT pins to one shard would deterministically red that shard on every run (no actual test failure; the file just never executes). - Extract the subprocess spawn/communicate/process-tree-kill logic into a shared _spawn_pytest_once() helper (removes ~90 lines of duplication between the primary run and the retry). - Replace the single-shot retry with a bounded backoff loop (_EXIT4_RETRY_ATTEMPTS, escalating sleep) that re-runs while the file is present on disk. - Add _file_present() which re-checks existence across a few spaced stats, so a single flaky negative stat doesn't wrongly conclude the file is missing. A genuinely-missing file (typo/deleted) still fails fast — exit 4 is not swallowed when the file truly does not exist. - Tests: transient-then-pass recovery, genuinely-missing fails fast with no retry, give-up after max attempts, and _file_present transient/missing cases.	2026-06-09 21:39:09 -07:00
teknium1	754154a9c2	fix(tests): retry per-file pytest subprocess once on exit-4 when the file exists The parallel test runner sharded a present, tracked test file (tests/plugins/platforms/photon/test_inbound.py) onto a slice that then reported 'file or directory not found' (pytest exit 4) at exec time — even though the planner had just enumerated the file via --collect-only ('5269 passed, 0 failed' in the same run). On loaded shared CI runners the per-file subprocess can fail to stat a file the planner already saw; the deterministic LPT slicer then reproduces it on every rerun because the same file set lands on the same shard. Fix: when a per-file run exits 4 AND the file still exists on disk, retry the subprocess once before surfacing it as a hard failure. This kills the shard-flake class for everyone, not just this PR. Does NOT widen the exit-5-is-pass rule — exit 4 on a genuinely missing file still fails (verified). Retry reuses the same pgroup-kill cleanup as the primary run so no grandchildren orphan. Validation: photon dir runs green through scripts/run_tests_parallel.py; unit-level negative case confirms a nonexistent file still returns rc=4.	2026-06-08 13:38:30 -07:00
Ben	da8b2e95fd	ci(docker): run tests/docker/ in build-amd64 against the freshly-built image The new tests/docker/ suite (added by this PR) was being picked up by the sharded pytest matrix in tests.yml, where its session-scoped `built_image` fixture issued a 3-7min `docker build` under tests/docker/conftest.py's 180s pytest-timeout cap. Every test in the directory failed in fixture setup across all 6 shards. Fix the suite so it actually runs (not skips): 1. Wire the docker tests into docker-publish.yml's build-amd64 job, right after the existing smoke test. The image is already loaded into the local daemon as `nousresearch/hermes-agent:test`; set HERMES_TEST_IMAGE to that and the fixture's pre-built-image branch short-circuits the rebuild. 21 tests run in ~90s locally against a prebuilt image, no rebuild cost on top of the existing build step. 2. Exclude tests/docker/ from scripts/run_tests_parallel.py's default discovery so the sharded matrix in tests.yml stops trying to build the image. Explicit positional paths (`pytest tests/docker/` or `scripts/run_tests.sh tests/docker/`) still pick the suite up — the skip rule honors directory-level user intent, matching the existing per-file override pattern. The dedicated docker-tests step runs on every PR that touches docker code (the existing path filters on docker-publish.yml already cover `tests/docker/` via `/*.py`), so the suite gates real changes. (cherry picked from commit `4c481860ce`)	2026-05-25 12:40:57 +10:00
teknium1	5cbb132c1d	fix(ci): exclude tests/docker/ from regular test shards; pin read_text encoding Two CI follow-ups to @benbarclay's #30136 salvage: 1. scripts/run_tests_parallel.py — add 'docker' to _SKIP_PARTS so the new tests/docker/ harness doesn't run in the regular test (N) matrix. The harness builds the real Dockerfile in a session fixture, which can exceed pytest-timeout's 180s ceiling on ubuntu-latest where Docker IS available — it surfaced as 6 identical setup-timeout failures across slices 1–6 on the first CI run. The docker harness has its own dedicated runner via .github/actions/hermes-smoke-test (added in #30136) plus the docker-lint workflow. Same treatment as tests/integration/ and tests/e2e/ — runs separately, not in the main shards. 2. hermes_cli/service_manager.py — pin encoding='utf-8' on the /proc/1/comm read_text call. Ruff PLW1514 enforcement rolled in between Ben's last push and the salvage; pure ruff-fix, no behavior change.	2026-05-24 18:23:13 -07:00
ethernet	b689624aee	feat(ci): 4-way matrix slicing with LPT duration-balanced distribution run_tests_parallel.py: - --slice I/N flag (also HERMES_TEST_SLICE env var) runs only the I-th slice of N, distributing files across slices by cached duration using LPT (Longest Processing Time first) greedy algorithm so each slice gets roughly equal wall time - Duration cache (test_durations.json): maps relative file paths to last-observed subprocess wall time. _save_durations merges with existing cache so entries from other slices are preserved. - Per-file subprocess timing in progress output + end-of-run distribution summary (percentiles, top-10 slowest, <1s/<2s counts) - Unknown files default to 2.0s estimate (~P50), spread evenly by LPT .github/workflows/tests.yml: - Matrix strategy: slice [1, 2, 3, 4] with fail-fast: false - Each slice restores duration cache from main (stable key, no SHA), runs its portion, uploads per-slice durations as artifacts - save-durations job (main only, if: always()) downloads all 4 artifacts, merges into single cache entry for future PRs - Timeout reduced from 60min to 30min per slice (~1/4 the work) Cache design: - Stable key (test-durations) not keyed by commit SHA — durations are about files, not commits, and SHA-keyed caches miss on every new commit and on PR merge commits - actions/cache scoping: main's cache is visible to all PRs targeting main; feature branches without a cache still work (default 2.0s) - No dotfile prefix (upload-artifact v7 skips hidden files)	2026-05-22 19:46:18 -07:00
ethernet	48be2e0e4d	test: use subprocesses for each test file (#29016 ) * ci(tests): install ripgrep from prebuilt tarball instead of apt apt-get update + install of ripgrep takes ~4 min on the GHA Ubuntu runners (the apt-get update against archive.ubuntu.com is the slow part; ripgrep itself is small). Switching to the upstream musl binary tarball cuts the step to a few seconds. - Pinned to ripgrep 15.1.0 with sha256 verification (same hash as published in the releases sha256 sidecar file). - Drops the `rg` binary into /usr/local/bin so it is on PATH for every subsequent step without GITHUB_PATH manipulation. - Applied to both the test and e2e jobs in tests.yml. * fix(cli): compile syntax check to tempdir, not source __pycache__ `_validate_critical_files_syntax` runs `py_compile.compile()` on each critical bootstrap file after a successful `git pull`. The default `py_compile` writes the resulting `.pyc` next to the source under `__pycache__/`, which causes two real problems: 1. Parallel test workers walking the same source tree (e.g. running the suite under per-file process isolation) can race against each other on the `__pycache__` write — manifests as flaky 'directory not empty' errors during teardown. 2. In production, the post-pull syntax check leaves a `.pyc` behind that the next interpreter run might pick up — fine when the interpreter version matches, sketchy if it doesn't. Fix: write the compiled output to a `tempfile.TemporaryDirectory()` that's discarded on function exit. We only care about the compile-or-not signal, not the artifact. * test(runner): per-file process isolation, drop manual state reset + xdist Replace fragile manual _reset_module_state test fixtures with robust per-file subprocess isolation. Each test file runs in a fresh `python -m pytest <file>` subprocess via ThreadPoolExecutor. No xdist, no custom pytest plugin, no shared worker state. Key changes: * scripts/run_tests_parallel.py — new runner: discovers test files, runs N in parallel via ThreadPoolExecutor, captures stdout per file, treats exit code 5 (no tests collected) as pass, kills all children on exit. Change from cpu_count to cpu_count2. The runner is I/O-bound (waiting on subprocess.communicate() from pytest children) The parent process does almost no CPU work, so 2x oversubscription keeps more pipes full. When a file fails, immediately show the last 30 lines of pytest output (stack traces + FAILED summary) plus a ready-to-copy repro command: python -m pytest tests/agent/test_auxiliary_client.py scripts/run_tests.sh — delegates to run_tests_parallel.py * .github/workflows/tests.yml — test step: python scripts/run_tests_parallel.py * pyproject.toml — drop pytest-xdist, pytest-split; simplify addopts * tests/conftest.py — remove ~200 lines of manual state-reset fixtures * AGENTS.md — update Testing section for per-file design * test(runner): speed gateway test antipattern scan up * fix(test): web search provider plugin test missing xai * fix(tests): make 14 test files pass under per-file subprocess isolation Tests that relied on cross-file state pollution from xdist workers fail when run in isolation (per-file subprocess model). Root causes and fixes: Tool registry not populated: - test_video_generation_tool_surface_matrix: add discover_builtin_tools() - test_web_providers_brave_free/ddgs/searxng/general: autouse fixtures registering all 8 bundled web providers, reset after each test - test_website_policy: same provider registration pattern - test_web_tools_tavily: same pattern across 3 dispatch test classes - Also add is_safe_url/check_website_access mocks where SSRF check blocks example.com (DNS resolution fails in isolated envs) Stale check_fn cache: - test_kanban_tools: invalidate_check_fn_cache() + _clear_tool_defs_cache() in both kanban guidance tests (prior test cached False for kanban_show) - test_discord_tool: cache invalidation in setup/teardown - test_homeassistant_tool: invalidate_check_fn_cache() before registry queries Module-level state pollution: - test_auxiliary_client: autouse fixture clearing _aux_unhealthy_until cache - test_skill_commands: set_session_vars() instead of patch.dict(os.environ) (ContextVar takes precedence over os.environ) - test_dm_topics: overwrite sys.modules + separate telegram.constants mock + force-reimport of gateway.platforms.telegram - test_terminal_tool_requirements: removed duplicate class declaration, autouse _clear_caches fixture * change(tests): run_tests.sh explicitly includes env vars instead of manually dropping some vars, now we just only include some * fix(tests): 5 more isolation/NixOS fixes - test_approval_plugin_hooks: isolate HERMES_HOME so real user's command_allowlist doesn't short-circuit the approval path - test_google_chat: skipif when Platform.GOOGLE_CHAT not in enum (feature not merged on this branch) - test_write_deny: test systemd prefix against tmp_path instead of /etc/systemd which resolves to /nix/store on NixOS - test_pty_bridge: use shutil.which('cat') instead of /bin/cat (doesn't exist on NixOS) - profiles.py: rmtree onexc handler chmod's parent dirs too, fixing profile deletion when copytree preserved read-only modes from nix store * fix(tests): clear unhealthy cache in autouse fixture for auxiliary_client * fix(tests): skip send_message when telegram not installed; handle missing worker_id in browser_supervisor * fix: py3.11 rmtree onexc compat + belt-and-suspenders unhealthy cache clear for expired codex test * fix: address PR #29016 review feedback - Remove tracked .pytest-cache/ artifact and add to .gitignore - Fix stale 'xdist worker' comment in conftest.py - Deduplicate web provider registration into tests/tools/conftest.py shared helper (register_all_web_providers), replacing 8 copy-pasted blocks across 6 test files - Update PR description: remove stale recovered-test-files claim, fix worker count to match code (cpu_count2) fix: eliminate race in stale-cache achievements test The background scan thread could complete and overwrite _SNAPSHOT_CACHE before evaluate_all() returned the stale data — only 10 fake sessions made the scan finish instantly. Added scan_delay param to _FakeSessionDB and set it to 2s in the stale-cache test so the background thread can't win the race.	2026-05-21 16:40:04 +05:30

16 commits