* test: make test env hermetic; enforce CI parity via scripts/run_tests.sh
Fixes the recurring 'works locally, fails in CI' (and vice versa) class
of flakes by making tests hermetic and providing a canonical local runner
that matches CI's environment.
## Layer 1 — hermetic conftest.py (tests/conftest.py)
Autouse fixture now unsets every credential-shaped env var before every
test, so developer-local API keys can't leak into tests that assert
'auto-detect provider when key present'.
Pattern: unset any var ending in _API_KEY, _TOKEN, _SECRET, _PASSWORD,
_CREDENTIALS, _ACCESS_KEY, _PRIVATE_KEY, etc. Plus an explicit list of
credential names that don't fit the suffix pattern (AWS_ACCESS_KEY_ID,
FAL_KEY, GH_TOKEN, etc.) and all the provider BASE_URL overrides that
change auto-detect behavior.
Also unsets HERMES_* behavioral vars (HERMES_YOLO_MODE, HERMES_QUIET,
HERMES_SESSION_*, etc.) that mutate agent behavior.
Also:
- Redirects HOME to a per-test tempdir (not just HERMES_HOME), so
code reading ~/.hermes/* directly can't touch the real dir.
- Pins TZ=UTC, LANG=C.UTF-8, LC_ALL=C.UTF-8, PYTHONHASHSEED=0 to
match CI's deterministic runtime.
The old _isolate_hermes_home fixture name is preserved as an alias so
any test that yields it explicitly still works.
## Layer 2 — scripts/run_tests.sh canonical runner
'Always use scripts/run_tests.sh, never call pytest directly' is the
new rule (documented in AGENTS.md). The script:
- Unsets all credential env vars (belt-and-suspenders for callers
who bypass conftest — e.g. IDE integrations)
- Pins TZ/LANG/PYTHONHASHSEED
- Uses -n 4 xdist workers (matches GHA ubuntu-latest; -n auto on
a 20-core workstation surfaces test-ordering flakes CI will never
see, causing the infamous 'passes in CI, fails locally' drift)
- Finds the venv in .venv, venv, or main checkout's venv
- Passes through arbitrary pytest args
Installs pytest-split on demand so the script can also be used to run
matrix-split subsets locally for debugging.
## Remove 3 module-level dotenv stubs that broke test isolation
tests/hermes_cli/test_{arcee,xiaomi,api_key}_provider.py each had a
module-level:
if 'dotenv' not in sys.modules:
fake_dotenv = types.ModuleType('dotenv')
fake_dotenv.load_dotenv = lambda *a, **kw: None
sys.modules['dotenv'] = fake_dotenv
This patches sys.modules['dotenv'] to a fake at import time with no
teardown. Under pytest-xdist LoadScheduling, whichever worker collected
one of these files first poisoned its sys.modules; subsequent tests in
the same worker that imported load_dotenv transitively (e.g.
test_env_loader.py via hermes_cli.env_loader) got the no-op lambda and
saw their assertions fail.
dotenv is a required dependency (python-dotenv>=1.2.1 in pyproject.toml),
so the defensive stub was never needed. Removed.
## Validation
- tests/hermes_cli/ alone: 2178 passed, 1 skipped, 0 failed (was 4
failures in test_env_loader.py before this fix)
- tests/test_plugin_skills.py, tests/hermes_cli/test_plugins.py,
tests/test_hermes_logging.py combined: 123 passed (the caplog
regression tests from PR #11453 still pass)
- Local full run shows no F/E clusters in the 0-55% range that were
previously present before the conftest hardening
## Background
See AGENTS.md 'Testing' section for the full list of drift sources
this closes. Matrix split (closed as #11566) will be re-attempted
once this foundation lands — cross-test pollution was the root cause
of the shard-3 hang in that PR.
* fix(conftest): don't redirect HOME — it broke CI subprocesses
PR #11577's autouse fixture was setting HOME to a per-test tempdir.
CI started timing out at 97% complete with dozens of E/F markers and
orphan python processes at cleanup — tests (or transitive deps)
spawn subprocesses that expect a stable HOME, and the redirect broke
them in non-obvious ways.
Env-var unsetting and TZ/LANG/hashseed pinning (the actual CI-drift
fixes) are unchanged and still in place. HERMES_HOME redirection is
also unchanged — that's the canonical way to isolate tests from
~/.hermes/, not HOME.
Any code in the codebase reading ~/.hermes/* via `Path.home() / ".hermes"`
instead of `get_hermes_home()` is a bug to fix at the callsite, not
something to paper over in conftest.
* fix(tests): mock is_safe_url in tests that use example.com
Tests using example.com URLs were failing because is_safe_url does a real DNS lookup which fails in environments where example.com doesn't resolve, causing the request to be blocked before reaching the already-mocked HTTP client. This should fix around 17 failing tests.
These tests test logic, caching, etc. so mocking this method should not modify them in any way. TestMattermostSendUrlAsFile was already doing this so we follow the same pattern.
* fix(test): use case-insensitive lookup for model context length check
DEFAULT_CONTEXT_LENGTHS uses inconsistent casing (MiniMax keys are lowercase, Qwen keys are mixed-case) so the test was broken in some cases since it couldn't find the model.
* fix(test): patch is_linux in systemd gateway restart test
The test only patched is_macos to False but didn't patch is_linux to True. On macOS hosts, is_linux() returns False and the systemd restart code path is skipped entirely, making the assertion fail.
* fix(test): use non-blocklisted env var in docker forward_env tests
GITHUB_TOKEN is in api_key_env_vars and thus in _HERMES_PROVIDER_ENV_BLOCKLIST so the env var is silently dropped, we replace it with a non-blocked one like DATABASE_URL so the tests actually work.
* fix(test): fully isolate _has_any_provider_configured from host env
_has_any_provider_configured() checks all env vars from PROVIDER_REGISTRY (not just the 5 the tests were clearing) and also calls get_auth_status() which detects gh auth token for Copilot. On machines with any of these set, the function returns True before reaching the code path under test.
Clear all registry vars and mock get_auth_status so host credentials don't interfere.
* fix(test): correct path to hermes_base_env.py in tool parser tests
Path(__file__).parent.parent resolved to tests/, not the project root.
The file lives at environments/hermes_base_env.py so we need one more parent level.
* fix(test): accept optional HTML fields in Matrix send payload
_send_matrix sometimes adds format and formatted_body when the markdown library is installed. The test was doing an exact dict equality check which broke. Check required fields instead.
* fix(test): add config.yaml to codex vision requirements test
The test only wrote auth.json but not config.yaml, so _read_main_provider() returned empty and vision auto-detect never tried the codex provider. Add a config.yaml pointing at openai-codex so the fallback path actually resolves the client.
* fix(test): clear OPENROUTER_API_KEY in _isolate_hermes_home
run_agent.py calls load_hermes_dotenv() at import time, which injects API keys from ~/.hermes/.env into os.environ before any test fixture runs. This caused test_agent_loop_tool_calling to make real API calls instead of skipping, which ends up making some tests fail.
* fix(test): add get_rate_limit_state to agent mock in usage report tests
_show_usage now calls agent.get_rate_limit_state() for rate limit
display. The SimpleNamespace mock was missing this method.
* fix(test): update expected Camofox config version from 12 to 13
* fix(test): mock _get_enabled_platforms in nous managed defaults test
Importing gateway.run leaks DISCORD_BOT_TOKEN into os.environ, which makes _get_enabled_platforms() return ["cli", "discord"] instead of just ["cli"]. tools_command loops per platform, so apply_nous_managed_defaults
runs twice: the first call sets config values, the second sees them as
already configured and returns an empty set, causing the assertion to
fail.
* fix: prevent infinite 400 failure loop on context overflow (#1630)
When a gateway session exceeds the model's context window, Anthropic may
return a generic 400 invalid_request_error with just 'Error' as the
message. This bypassed the phrase-based context-length detection,
causing the agent to treat it as a non-retryable client error. Worse,
the failed user message was still persisted to the transcript, making
the session even larger on each attempt — creating an infinite loop.
Three-layer fix:
1. run_agent.py — Fallback heuristic: when a 400 error has a very short
generic message AND the session is large (>40% of context or >80
messages), treat it as a probable context overflow and trigger
compression instead of aborting.
2. run_agent.py + gateway/run.py — Don't persist failed messages:
when the agent returns failed=True before generating any response,
skip writing the user's message to the transcript/DB. This prevents
the session from growing on each failure.
3. gateway/run.py — Smarter error messages: detect context-overflow
failures and suggest /compact or /reset specifically, instead of a
generic 'try again' that will fail identically.
* fix(skills): detect prompt injection patterns and block cache file reads
Adds two security layers to prevent prompt injection via skills hub
cache files (#1558):
1. read_file: blocks direct reads of ~/.hermes/skills/.hub/ directory
(index-cache, catalog files). The 3.5MB clawhub_catalog_v1.json
was the original injection vector — untrusted skill descriptions
in the catalog contained adversarial text that the model executed.
2. skill_view: warns when skills are loaded from outside the trusted
~/.hermes/skills/ directory, and detects common injection patterns
in skill content ("ignore previous instructions", "<system>", etc.).
Cherry-picked from PR #1562 by ygd58.
* fix(tools): chunk long messages in send_message_tool before dispatch (#1552)
Long messages sent via send_message tool or cron delivery silently
failed when exceeding platform limits. Gateway adapters handle this
via truncate_message(), but the standalone senders in send_message_tool
bypassed that entirely.
- Apply truncate_message() chunking in _send_to_platform() before
dispatching to individual platform senders
- Remove naive message[i:i+2000] character split in _send_discord()
in favor of centralized smart splitting
- Attach media files to last chunk only for Telegram
- Add regression tests for chunking and media placement
Cherry-picked from PR #1557 by llbn.
* fix(approval): show full command in dangerous command approval (#1553)
Previously the command was truncated to 80 chars in CLI (with a
[v]iew full option), 500 chars in Discord embeds, and missing entirely
in Telegram/Slack approval messages. Now the full command is always
displayed everywhere:
- CLI: removed 80-char truncation and [v]iew full menu option
- Gateway (TG/Slack): approval_required message includes full command
in a code block
- Discord: embed shows full command up to 4096-char limit
- Windows: skip SIGALRM-based test timeout (Unix-only)
- Updated tests: replaced view-flow tests with direct approval tests
Cherry-picked from PR #1566 by crazywriter1.
---------
Co-authored-by: buray <ygd58@users.noreply.github.com>
Co-authored-by: lbn <llbn@users.noreply.github.com>
Co-authored-by: crazywriter1 <53251494+crazywriter1@users.noreply.github.com>
Add base_url/api_key overrides for auxiliary tasks and delegation so users can
route those flows straight to a custom OpenAI-compatible endpoint without
having to rely on provider=main or named custom providers.
Also clear gateway session env vars in test isolation so the full suite stays
deterministic when run from a messaging-backed agent session.
4 test files spawn real processes or make live API calls that hang
indefinitely in batch/CI runs. Skip them with pytestmark:
- tests/tools/test_code_execution.py (subprocess spawns)
- tests/tools/test_file_tools_live.py (live LocalEnvironment)
- tests/test_413_compression.py (blocks on process)
- tests/test_agent_loop_tool_calling.py (live OpenRouter API calls)
Also added global 30s signal.alarm timeout in conftest.py as a safety
net, and removed stale nous-api test that hung on OAuth browser login.
Suite now runs in ~55s with no hangs.
Added a fixture to redirect HERMES_HOME to a temporary directory during tests, preventing writes to the user's home directory. Updated the test for DebugSession to create a dedicated log directory for saving logs, ensuring test isolation and accuracy in assertions.