test(ci): harden two flaky tests against CI noise (#33675)
Some checks are pending
Deploy Site / deploy-vercel (push) Waiting to run
Deploy Site / deploy-docs (push) Waiting to run
Docker / shell lint / Lint Dockerfile (hadolint) (push) Waiting to run
Docker / shell lint / Lint docker/ shell scripts (shellcheck) (push) Waiting to run
Docker Build and Publish / build-amd64 (push) Waiting to run
Docker Build and Publish / build-arm64 (push) Waiting to run
Docker Build and Publish / merge (push) Blocked by required conditions
Lint (ruff + ty) / ruff + ty diff (push) Waiting to run
Lint (ruff + ty) / ruff enforcement (blocking) (push) Waiting to run
Lint (ruff + ty) / Windows footguns (blocking) (push) Waiting to run
Nix / nix (macos-latest) (push) Waiting to run
Nix / nix (ubuntu-latest) (push) Waiting to run
OSV-Scanner / Scan lockfiles (push) Waiting to run
Tests / test (1) (push) Waiting to run
Tests / test (2) (push) Waiting to run
Tests / test (3) (push) Waiting to run
Tests / test (4) (push) Waiting to run
Tests / test (5) (push) Waiting to run
Tests / test (6) (push) Waiting to run
Tests / save-durations (push) Blocked by required conditions
Tests / e2e (push) Waiting to run
uv.lock check / uv lock --check (push) Waiting to run

Two unrelated transient failures on PR #33661's initial CI run, both
pre-existing on main and recovered on rerun. Hardening:

1. tests/cron/test_scheduler.py::TestRunJobConfigLogging — added mocks for
   resolve_runtime_provider() and discover_mcp_tools(). The yaml-warning
   tests intend to exercise only the warning-log path, but
   _run_job_impl continues into provider resolution and MCP discovery
   after the warning. Both can spawn subprocesses / hit the network and
   pushed the test over its 30s budget under GHA load.

2. tests/tools/test_browser_supervisor.py — wrapped Chrome teardown
   against the stdlib subprocess._wait() race (bpo-38630). When SIGCHLD
   arrives during proc.wait(), _try_wait(WNOHANG) can return a foreign
   pid and the 'assert pid == self.pid or pid == 0' fires. Fixture now
   catches AssertionError/TimeoutExpired, force-kills, and always reaps
   so no zombie escapes. Same hardening applied to the early-skip branch.
This commit is contained in:
Teknium 2026-05-27 23:15:41 -07:00 committed by GitHub
parent 875d930ac7
commit 4e702fe2d9
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
2 changed files with 47 additions and 5 deletions

View file

@ -89,18 +89,45 @@ def chrome_cdp(request):
except Exception:
time.sleep(0.25)
if ws_url is None:
proc.terminate()
proc.wait(timeout=5)
try:
proc.terminate()
proc.wait(timeout=5)
except (subprocess.TimeoutExpired, AssertionError, Exception):
try:
proc.kill()
except Exception:
pass
try:
proc.wait(timeout=2)
except (AssertionError, Exception):
pass
shutil.rmtree(profile, ignore_errors=True)
pytest.skip("Chrome didn't expose CDP in time")
yield ws_url, port
proc.terminate()
# Tear down Chrome. The stdlib `subprocess._wait()` POSIX implementation
# has a known race (https://bugs.python.org/issue38630): when SIGCHLD
# arrives concurrently with `proc.wait()`, `_try_wait(WNOHANG)` can
# return a foreign pid and the `assert pid == self.pid or pid == 0`
# fires. We saw this in CI on slice 1 after this fixture's teardown
# (PR #33661 follow-up). Swallow the stdlib race + force-kill if wait
# hangs, then always reap so we don't leak a zombie.
try:
proc.terminate()
except Exception:
pass
try:
proc.wait(timeout=3)
except Exception:
proc.kill()
except (subprocess.TimeoutExpired, AssertionError, Exception):
try:
proc.kill()
except Exception:
pass
try:
proc.wait(timeout=2)
except (AssertionError, Exception):
pass
shutil.rmtree(profile, ignore_errors=True)