test(ci): harden two flaky tests against CI noise (#33675)
Some checks are pending
Deploy Site / deploy-vercel (push) Waiting to run
Deploy Site / deploy-docs (push) Waiting to run
Docker / shell lint / Lint Dockerfile (hadolint) (push) Waiting to run
Docker / shell lint / Lint docker/ shell scripts (shellcheck) (push) Waiting to run
Docker Build and Publish / build-amd64 (push) Waiting to run
Docker Build and Publish / build-arm64 (push) Waiting to run
Docker Build and Publish / merge (push) Blocked by required conditions
Lint (ruff + ty) / ruff + ty diff (push) Waiting to run
Lint (ruff + ty) / ruff enforcement (blocking) (push) Waiting to run
Lint (ruff + ty) / Windows footguns (blocking) (push) Waiting to run
Nix / nix (macos-latest) (push) Waiting to run
Nix / nix (ubuntu-latest) (push) Waiting to run
OSV-Scanner / Scan lockfiles (push) Waiting to run
Tests / test (1) (push) Waiting to run
Tests / test (2) (push) Waiting to run
Tests / test (3) (push) Waiting to run
Tests / test (4) (push) Waiting to run
Tests / test (5) (push) Waiting to run
Tests / test (6) (push) Waiting to run
Tests / save-durations (push) Blocked by required conditions
Tests / e2e (push) Waiting to run
uv.lock check / uv lock --check (push) Waiting to run

Two unrelated transient failures on PR #33661's initial CI run, both
pre-existing on main and recovered on rerun. Hardening:

1. tests/cron/test_scheduler.py::TestRunJobConfigLogging — added mocks for
   resolve_runtime_provider() and discover_mcp_tools(). The yaml-warning
   tests intend to exercise only the warning-log path, but
   _run_job_impl continues into provider resolution and MCP discovery
   after the warning. Both can spawn subprocesses / hit the network and
   pushed the test over its 30s budget under GHA load.

2. tests/tools/test_browser_supervisor.py — wrapped Chrome teardown
   against the stdlib subprocess._wait() race (bpo-38630). When SIGCHLD
   arrives during proc.wait(), _try_wait(WNOHANG) can return a foreign
   pid and the 'assert pid == self.pid or pid == 0' fires. Fixture now
   catches AssertionError/TimeoutExpired, force-kills, and always reaps
   so no zombie escapes. Same hardening applied to the early-skip branch.
This commit is contained in:
Teknium 2026-05-27 23:15:41 -07:00 committed by GitHub
parent 875d930ac7
commit 4e702fe2d9
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
2 changed files with 47 additions and 5 deletions

View file

@ -1450,9 +1450,19 @@ class TestRunJobConfigLogging:
"prompt": "hello",
}
# Mock heavy post-yaml work so the test only exercises the warning
# path. Without these mocks, _run_job_impl continues into provider
# resolution and MCP discovery, both of which can spawn subprocesses
# / hit the network and have caused this test to time out on CI
# (>30s wall clock) under load. See PR #33661 follow-up.
with patch("cron.scheduler._hermes_home", tmp_path), \
patch("cron.scheduler._resolve_origin", return_value=None), \
patch("dotenv.load_dotenv"), \
patch("hermes_cli.runtime_provider.resolve_runtime_provider",
return_value={"provider": "openrouter", "api_key": "x",
"base_url": "https://example.invalid",
"api_mode": "chat_completions"}), \
patch("tools.mcp_tool.discover_mcp_tools", return_value=[]), \
patch("run_agent.AIAgent") as mock_agent_cls:
mock_agent = MagicMock()
mock_agent.run_conversation.return_value = {"final_response": "ok"}
@ -1482,6 +1492,11 @@ class TestRunJobConfigLogging:
with patch("cron.scheduler._hermes_home", tmp_path), \
patch("cron.scheduler._resolve_origin", return_value=None), \
patch("dotenv.load_dotenv"), \
patch("hermes_cli.runtime_provider.resolve_runtime_provider",
return_value={"provider": "openrouter", "api_key": "x",
"base_url": "https://example.invalid",
"api_mode": "chat_completions"}), \
patch("tools.mcp_tool.discover_mcp_tools", return_value=[]), \
patch("run_agent.AIAgent") as mock_agent_cls:
mock_agent = MagicMock()
mock_agent.run_conversation.return_value = {"final_response": "ok"}

View file

@ -89,18 +89,45 @@ def chrome_cdp(request):
except Exception:
time.sleep(0.25)
if ws_url is None:
proc.terminate()
proc.wait(timeout=5)
try:
proc.terminate()
proc.wait(timeout=5)
except (subprocess.TimeoutExpired, AssertionError, Exception):
try:
proc.kill()
except Exception:
pass
try:
proc.wait(timeout=2)
except (AssertionError, Exception):
pass
shutil.rmtree(profile, ignore_errors=True)
pytest.skip("Chrome didn't expose CDP in time")
yield ws_url, port
proc.terminate()
# Tear down Chrome. The stdlib `subprocess._wait()` POSIX implementation
# has a known race (https://bugs.python.org/issue38630): when SIGCHLD
# arrives concurrently with `proc.wait()`, `_try_wait(WNOHANG)` can
# return a foreign pid and the `assert pid == self.pid or pid == 0`
# fires. We saw this in CI on slice 1 after this fixture's teardown
# (PR #33661 follow-up). Swallow the stdlib race + force-kill if wait
# hangs, then always reap so we don't leak a zombie.
try:
proc.terminate()
except Exception:
pass
try:
proc.wait(timeout=3)
except Exception:
proc.kill()
except (subprocess.TimeoutExpired, AssertionError, Exception):
try:
proc.kill()
except Exception:
pass
try:
proc.wait(timeout=2)
except (AssertionError, Exception):
pass
shutil.rmtree(profile, ignore_errors=True)