mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-05-30 06:41:51 +00:00
4 commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
48be2e0e4d
|
test: use subprocesses for each test file (#29016)
* ci(tests): install ripgrep from prebuilt tarball instead of apt
apt-get update + install of ripgrep takes ~4 min on the GHA Ubuntu
runners (the apt-get update against archive.ubuntu.com is the slow
part; ripgrep itself is small). Switching to the upstream musl
binary tarball cuts the step to a few seconds.
- Pinned to ripgrep 15.1.0 with sha256 verification (same hash as
published in the releases sha256 sidecar file).
- Drops the `rg` binary into /usr/local/bin so it is on PATH for
every subsequent step without GITHUB_PATH manipulation.
- Applied to both the test and e2e jobs in tests.yml.
* fix(cli): compile syntax check to tempdir, not source __pycache__
`_validate_critical_files_syntax` runs `py_compile.compile()` on each
critical bootstrap file after a successful `git pull`. The default
`py_compile` writes the resulting `.pyc` next to the source under
`__pycache__/`, which causes two real problems:
1. Parallel test workers walking the same source tree (e.g. running
the suite under per-file process isolation) can race against each
other on the `__pycache__` write — manifests as flaky 'directory
not empty' errors during teardown.
2. In production, the post-pull syntax check leaves a `.pyc` behind
that the next interpreter run might pick up — fine when the
interpreter version matches, sketchy if it doesn't.
Fix: write the compiled output to a `tempfile.TemporaryDirectory()`
that's discarded on function exit. We only care about the compile-or-not
signal, not the artifact.
* test(runner): per-file process isolation, drop manual state reset + xdist
Replace fragile manual _reset_module_state test fixtures with robust
per-file subprocess isolation. Each test file runs in a fresh
`python -m pytest <file>` subprocess via ThreadPoolExecutor. No xdist,
no custom pytest plugin, no shared worker state.
Key changes:
* scripts/run_tests_parallel.py — new runner: discovers test files,
runs N in parallel via ThreadPoolExecutor, captures stdout per file,
treats exit code 5 (no tests collected) as pass, kills all children
on exit. Change from cpu_count to cpu_count*2. The runner is
I/O-bound (waiting on subprocess.communicate() from pytest children)
The parent process does almost no CPU work, so 2x oversubscription
keeps more pipes full. When a file fails, immediately show the last
30 lines of pytest output (stack traces + FAILED summary) plus a
ready-to-copy repro command:
python -m pytest tests/agent/test_auxiliary_client.py
* scripts/run_tests.sh — delegates to run_tests_parallel.py
* .github/workflows/tests.yml — test step: python
scripts/run_tests_parallel.py
* pyproject.toml — drop pytest-xdist, pytest-split; simplify addopts
* tests/conftest.py — remove ~200 lines of manual state-reset fixtures
* AGENTS.md — update Testing section for per-file design
* test(runner): speed gateway test antipattern scan up
* fix(test): web search provider plugin test missing xai
* fix(tests): make 14 test files pass under per-file subprocess isolation
Tests that relied on cross-file state pollution from xdist workers
fail when run in isolation (per-file subprocess model). Root causes
and fixes:
Tool registry not populated:
- test_video_generation_tool_surface_matrix: add discover_builtin_tools()
- test_web_providers_brave_free/ddgs/searxng/general: autouse fixtures
registering all 8 bundled web providers, reset after each test
- test_website_policy: same provider registration pattern
- test_web_tools_tavily: same pattern across 3 dispatch test classes
- Also add is_safe_url/check_website_access mocks where SSRF check
blocks example.com (DNS resolution fails in isolated envs)
Stale check_fn cache:
- test_kanban_tools: invalidate_check_fn_cache() + _clear_tool_defs_cache()
in both kanban guidance tests (prior test cached False for kanban_show)
- test_discord_tool: cache invalidation in setup/teardown
- test_homeassistant_tool: invalidate_check_fn_cache() before registry queries
Module-level state pollution:
- test_auxiliary_client: autouse fixture clearing _aux_unhealthy_until cache
- test_skill_commands: set_session_vars() instead of patch.dict(os.environ)
(ContextVar takes precedence over os.environ)
- test_dm_topics: overwrite sys.modules + separate telegram.constants mock
+ force-reimport of gateway.platforms.telegram
- test_terminal_tool_requirements: removed duplicate class declaration,
autouse _clear_caches fixture
* change(tests): run_tests.sh explicitly includes env vars
instead of manually dropping some vars, now we just only include some
* fix(tests): 5 more isolation/NixOS fixes
- test_approval_plugin_hooks: isolate HERMES_HOME so real user's
command_allowlist doesn't short-circuit the approval path
- test_google_chat: skipif when Platform.GOOGLE_CHAT not in enum
(feature not merged on this branch)
- test_write_deny: test systemd prefix against tmp_path instead of
/etc/systemd which resolves to /nix/store on NixOS
- test_pty_bridge: use shutil.which('cat') instead of /bin/cat
(doesn't exist on NixOS)
- profiles.py: rmtree onexc handler chmod's parent dirs too, fixing
profile deletion when copytree preserved read-only modes from
nix store
* fix(tests): clear unhealthy cache in autouse fixture for auxiliary_client
* fix(tests): skip send_message when telegram not installed; handle missing worker_id in browser_supervisor
* fix: py3.11 rmtree onexc compat + belt-and-suspenders unhealthy cache clear for expired codex test
* fix: address PR #29016 review feedback
- Remove tracked .pytest-cache/ artifact and add to .gitignore
- Fix stale 'xdist worker' comment in conftest.py
- Deduplicate web provider registration into tests/tools/conftest.py
shared helper (register_all_web_providers), replacing 8 copy-pasted
blocks across 6 test files
- Update PR description: remove stale recovered-test-files claim,
fix worker count to match code (cpu_count*2)
* fix: eliminate race in stale-cache achievements test
The background scan thread could complete and overwrite _SNAPSHOT_CACHE
before evaluate_all() returned the stale data — only 10 fake sessions
made the scan finish instantly. Added scan_delay param to _FakeSessionDB
and set it to 2s in the stale-cache test so the background thread can't
win the race.
|
||
|
|
5fba236644
|
chore: ruff auto-fix PLR6201 resweep — tuple → set in membership tests (#27355)
Six days after #23937 (608 fixes) the codebase had accumulated 241 new PLR6201 violations. Same mechanical `x in (...)` → `x in {...}` fix, same zero-risk profile: set lookup is O(1) vs O(n) for tuple and the two are semantically equivalent for hashable scalar membership tests. All 241 instances fixed via `ruff check --select PLR6201 --fix --unsafe-fixes`, zero remaining. Every changed value is a hashable scalar (str/int/None/enum/signal); no risk of unhashable runtime errors. No behavior change. Test plan: - 119 files changed, +244/-244 (net zero) — exactly one-line edits - `ruff check` clean afterward - Compile checks pass on the largest touched files (cli.py, run_agent.py, gateway/run.py, gateway/platforms/discord.py, model_tools.py) - Subset broad test run on tests/gateway/ tests/hermes_cli/ tests/agent/ tests/tools/: 18187 passed, 59 pre-existing failures (verified against origin/main with the same shape — identical failure count, identical category — all xdist test-order flakes unrelated to this change) Follows the same template as PR #23937 ([tracker: #23972](https://github.com/NousResearch/hermes-agent/issues/23972)). |
||
|
|
fd0796947f
|
fix: stabilize CI — TS widen, sys.modules restore, WS subscriber race (#17836)
Three narrow fixes targeting the remaining red checks after #17828: 1. ui-tui/src/app/slash/commands/ops.ts (Docker Build): /reload-mcp's local params type annotated session_id: string while ctx.sid is string | null. Widen to string | null — matches every other rpc call site and the test harness which passes { session_id: null }. Fixes TS2322 on line 86. The rpc signature itself is Record<string, unknown>, so this is purely a local typing fix, no behavioral change. 2. tests/plugins/test_achievements_plugin.py (13 cascading test failures): _install_fake_session_db did a raw sys.modules['hermes_state'] = fake_module without restoration, leaking the fake across xdist worker boundaries. Downstream tests doing from hermes_state import SessionDB got a module whose SessionDB was lambda: fake_db — 6 test_hermes_state.py tests failed with AttributeError: 'function' object has no attribute '_sanitize_fts5_query' / _contains_cjk, and 7 test_860_dedup.py tests failed with TypeError: got unexpected keyword argument 'db_path' (real code calls SessionDB(db_path=...)). Fix: stash monkeypatch on the plugin_api module object in the fixture, and have the helper do monkeypatch.setitem(sys.modules, 'hermes_state', fake_module) for auto-restoration at test teardown. 3. tests/hermes_cli/test_web_server.py (WS race): TestPtyWebSocket::test_pub_broadcasts_to_events_subscribers hit the 30s test timeout on CI. websocket_connect returns after ws.accept() — but /api/events registers the subscriber in _event_channels on the NEXT await (inside _event_lock). A publish immediately after connect could race ahead of registration and be dropped, and the subsequent receive_text() blocked until SIGALRM killed the test. Fix: poll _event_channels after the subscriber connects, before publishing. Validation: scripts/run_tests.sh tests/plugins/test_achievements_plugin.py tests/run_agent/test_860_dedup.py tests/test_hermes_state.py tests/hermes_cli/test_web_server.py 338 passed cd ui-tui && npm run type-check clean cd ui-tui && npm run build clean Remaining red checks are pure infra (Nix ubuntu hits TwirpErrorResponse ResourceExhausted on the GH Actions cache API; Nix macos bounces between npm build openssl-legacy and cache rate-limits) and cannot be fixed in the codebase. |
||
|
|
62a5d7207d
|
feat(plugins): bundle hermes-achievements + scan full session history (#17754)
* feat(plugins): bundle hermes-achievements, scan full session history Ships @PCinkusz's hermes-achievements dashboard plugin (https://github.com/PCinkusz/hermes-achievements) as a bundled plugin at plugins/hermes-achievements/ and fixes a bug in the scan path that made the plugin only see the first 200 sessions — making lifetime badges (50k tool calls, 75k errors, etc.) unreachable on long-running installs. Changes: - plugins/hermes-achievements/: vendor v0.3.1 verbatim (manifest, dist/, plugin_api.py, tests, docs, README). - plugins/hermes-achievements/dashboard/plugin_api.py: * scan_sessions(): limit=None now scans ALL sessions via SQLite LIMIT -1. Previously capped at 200, so users with 8000+ sessions saw ~2% of their history. * evaluate_all(): first-ever scans run in a background thread so the dashboard request path never blocks. Stale snapshots serve immediately while a background refresh runs. force=True still blocks synchronously for manual /rescan. * _build_pending_snapshot(), _start_background_scan(), _run_scan_and_update_cache(): supporting plumbing + idempotent thread spawn. - tests/plugins/test_achievements_plugin.py: new tests covering the 200-cap regression, the background-scan first-run flow, stale-serve-plus-background-refresh, forced sync rescan, and scan-thread idempotency. - website/docs/user-guide/features/built-in-plugins.md: lists hermes-achievements in the bundled-plugins table and documents API endpoints, state files, and performance characteristics. E2E validated against a real 8564-session ~6.4GB state.db: * Cold scan: 13m 19s (one-time, backgrounded — UI never blocks) * Warm rescan: 1.47s (8563/8564 sessions reused from checkpoint cache) * 57/60 achievements unlocked, 3 discovered — aggregates like total_tool_calls=259958, total_errors=164213, skill_events=368243 correctly surface lifetime badges that the 200-cap made unreachable. Original credit: @PCinkusz (MIT-licensed). Upstream repo remains the staging ground for new badges; this bundle keeps the dashboard feature parity with Hermes core changes. * feat(achievements): publish partial snapshots during cold scan Previously a cold scan on a large session DB (13min on 8564 sessions) showed zero badges for the entire duration, then every badge at once when the scan completed. A dashboard refresh mid-scan was indistinguishable from a fresh install with no history. Now the scanner publishes a partial snapshot to _SNAPSHOT_CACHE every 250 sessions, so each refresh during a cold scan surfaces more badges incrementally. Mechanism: - scan_sessions() takes an optional progress_callback fired every progress_every sessions with (sessions_so_far, scanned, total). - _compute_from_scan() is extracted from compute_all() and gains an is_partial flag that skips writing to state.json — we don't want to record unlocked_at based on a half-complete aggregate that a later session might rebalance. - _run_scan_and_update_cache() installs a publisher callback that builds a partial snapshot, marks it mode='in_progress', and writes it to the cache with age=0 so the UI keeps polling /scan-status and picks up the final snapshot when the scan completes. - Manual /rescan (force=True) disables partial publishing — the caller is blocking on the final result anyway. E2E against real 8564-session state.db (polled cache every 10s): t=10s: cache empty t=20s: 250/8564 scanned, 35 unlocked, 25 discovered t=40s: 500/8564 scanned, 42 unlocked, 18 discovered t=60s: 1000/8564 scanned, 49 unlocked, 11 discovered ... Tests: 9/9 pass (2 new — partial snapshot publication + no-persist-on-partial). Upstream unittest suite: 10/10 pass. * feat(achievements): in-progress scan banner with live % progress Previously the dashboard showed zero badges silently during long cold scans (13min on 8564 sessions). The backend was publishing partial snapshots every 250 sessions, but the bundled UI didn't surface any indicator that a scan was running — it just rendered the main page with whatever counts were currently published and no way for the user to know more progress was coming. UI changes (dist/index.js, dist/style.css): - Added a scan-in-progress banner rendered between the hero and stats when scan_meta.mode is 'pending' or 'in_progress'. Shows: BUILDING ACHIEVEMENT PROFILE… Scanned 1,750 of 8,564 sessions · 20%. Badges unlock as more history streams in. with a pulsing teal indicator and a filling teal/cyan progress bar. Disappears the moment the backend flips to 'full' or 'incremental'. - Added an auto-poller via useEffect — while scanInFlight is true the page re-fetches /achievements every 4s WITHOUT toggling the loading skeleton, so unlock counts tick up visibly without the user refreshing. The effect cleans itself up when the scan finishes. - Added refresh() (re-fetch, no loading flip) alongside the existing load() (full reload, used by the Rescan button). Attribution preserved: - Added a header comment to index.js crediting @PCinkusz (https://github.com/PCinkusz/hermes-achievements, MIT) as the original author, noting the banner is a layered addition on top of the original dist bundle. - Matching header comment in style.css, flagging the new .ha-scan-banner* rules as the local addition. Live-verified end to end: - Spun up `hermes dashboard --port 9229 --no-open` against a fresh HERMES_HOME symlinked to the real 8564-session state.db. - Opened /achievements in a browser, confirmed the banner renders with live progress: 'Scanned 1,000 of 8,564 sessions · 11%' → updates to '1,250 ... · 14%' → '1,750 ... · 20%' without user interaction, matching the backend's partial publications. - Stats row simultaneously climbed from 35 → 49 → 53 unlocked as more history streamed in. - Vision analysis of the rendered page confirms the banner styling matches the rest of the dashboard (dark card bg, teal accent, same small-caps typography, pulsing indicator reusing ha-pulse keyframes). |