Skip to content

[Bugfix][CI] Avoid CUDA init during tests.utils import#44258

Closed
alec-flowers wants to merge 8 commits into
vllm-project:mainfrom
alec-flowers:pr-44192-debug
Closed

[Bugfix][CI] Avoid CUDA init during tests.utils import#44258
alec-flowers wants to merge 8 commits into
vllm-project:mainfrom
alec-flowers:pr-44192-debug

Conversation

@alec-flowers
Copy link
Copy Markdown
Contributor

@alec-flowers alec-flowers commented Jun 1, 2026

Summary

Fixes a CUDA-before-fork failure exposed by the PR #44192 CI image changes.

tests.utils is imported during pytest collection by tests that only need helpers such as create_new_process_for_each_test. It had a top-level import of vllm.model_executor.kernels.linear for the TestFP8Layer helper. In an image where NIXL EP and FlashInfer comm are installed, that import can initialize CUDA in the parent pytest process. Forked CUDA tests then fail with:

RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method

This keeps those CUDA-touching imports lazy at their actual use sites and adds a canary to assert that importing tests.utils does not initialize CUDA. It also removes the CUDA_MODULE_LOADING=LAZY test-image workaround added in #44192: that environment variable does not stop this failure because the import path explicitly calls torch.cuda.get_device_capability(), which initializes PyTorch CUDA anyway.

Why This Surfaced Now

Short version: this is not because the logits processor test depends on NIXL. It surfaced because #44192 made the generic CI test image more release-like by installing KV connector dependencies into it.

Before #44192, the generic test image used by unrelated tests did not install requirements/kv_connectors.txt, so nixl_ep was not available there. That meant has_nixl_ep() returned false and the NIXL EP fused-MoE import branch was skipped during pytest collection. After #44192 set INSTALL_KV_CONNECTORS=true for the shared test image, nixl_ep became available to every test area using that image, including unrelated logits processor tests.

So NIXL did not necessarily newly start shipping nixl_ep; for example, the upstream release image vllm/vllm-openai:v0.22.0-ubuntu2404 already has nixl, nixl-cu12, nixl-cu13, and nixl_ep. The change is that unrelated CI tests started running in an image with those connector packages installed.

The exact import chain I observed is:

tests.utils
  -> vllm.model_executor.kernels.linear
  -> vllm.model_executor.kernels.linear.mixed_precision.marlin
  -> vllm.model_executor.layers.quantization.utils.marlin_utils
  -> vllm.model_executor.layers.fused_moe
  -> vllm.model_executor.layers.fused_moe.all2all_utils
  -> .prepare_finalize.nixl_ep          # only when has_nixl_ep() is true
  -> vllm.distributed.device_communicators.all2all
  -> has_flashinfer_nvlink_two_sided()
  -> has_flashinfer_comm()
  -> importlib.util.find_spec("flashinfer.comm")
  -> imports parent flashinfer package
  -> flashinfer.jit.env.CompilationContext()
  -> torch.cuda.get_device_capability()
  -> torch.cuda._lazy_init()

That explains why this was not hit by unrelated tests before: the top-level import in tests.utils was already too broad, but the CUDA-initializing branch depended on optional connector packages that were not previously installed in the shared CI test image.

There were already CI steps that manually installed connector dependencies, but those were the disaggregated test steps themselves. They ran commands like uv pip install --system -r /vllm-workspace/requirements/kv_connectors.txt and then launched the NIXL integration scripts. Those scripts start vllm serve processes and run the NIXL accuracy tests; they do not import tests.utils.create_new_process_for_each_test during collection and then fork CUDA work from that same parent pytest process. Because Buildkite steps run in separate job containers, those manual installs also did not leak into unrelated logits processor test jobs.

I reproduced the same underlying behavior in the upstream release image vllm/vllm-openai:v0.22.0-ubuntu2404:

dist vllm: 0.22.0
dist nixl: 1.1.0
dist nixl-cu12: 1.1.0
dist nixl-cu13: 1.1.0
dist cupy-cuda13x: 14.0.1
dist flashinfer-python: 0.6.11.post2
before import False
from vllm.model_executor.kernels.linear import _KernelT, init_fp8_linear_kernel
# stack reaches flashinfer/compilation_context.py: torch.cuda.get_device_capability()
after import True

Control in that same release image, simulating the old generic test image by forcing has_nixl_ep() false:

before import False
pretend nixl_ep is absent
after import False

Validation

Validated against CI image:

public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:e442c13f36f707e8de75043506713b454448ba75

Commands run with the local checkout mounted over /vllm-workspace:

python3 /vllm-workspace/tests/cuda/scripts/check_test_utils_no_cuda_init.py
python3 -m pytest -q -s tests/cuda/test_platform_no_cuda_init.py
python3 -m pytest -vv -s 'v1/logits_processors/test_correctness.py::test_logitsprocs[logitsprocs_under_test0-50-cuda:0]'
python3 -m pytest -q -s --tb=short v1/logits_processors/test_correctness.py
python3 -m pytest -s -x v1/kv_connector/nixl_integration/test_nixl_imports.py

Focused checks also re-run with CUDA_MODULE_LOADING unset. The CUDA no-init file was also run from /vllm-workspace/tests with PYTHONPATH unset to match Buildkite's test cwd:

unset CUDA_MODULE_LOADING
python3 /vllm-workspace/tests/cuda/scripts/check_test_utils_no_cuda_init.py
python3 -m pytest -q -s tests/cuda/test_platform_no_cuda_init.py
python3 -m pytest -vv -s 'v1/logits_processors/test_correctness.py::test_logitsprocs[logitsprocs_under_test0-50-cuda:0]'

Results:

check_test_utils_no_cuda_init.py: OK
tests/cuda/test_platform_no_cuda_init.py: 3 passed
reported logits processor node: 1 passed
v1/logits_processors/test_correctness.py: 28 passed
NIXL import canary: 1 passed
check_test_utils_no_cuda_init.py with CUDA_MODULE_LOADING unset: OK
tests/cuda/test_platform_no_cuda_init.py with CUDA_MODULE_LOADING unset: 3 passed
reported logits processor node with CUDA_MODULE_LOADING unset: 1 passed

NIXL image sanity:

torch.version.cuda: 13.0
nixl: 1.2.0
nixl-cu12: not installed
nixl-cu13: 1.2.0
nixl_ep_cpp links to libcudart.so.13

Notes

This is not duplicating an existing PR that I found for this specific tests.utils import-time CUDA initialization. AI assistance was used to investigate, reproduce, and draft this change; the author reviewed the diff and validation output.

Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
@njhill
Copy link
Copy Markdown
Member

njhill commented Jun 1, 2026

Thanks @alec-flowers

@njhill njhill added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 1, 2026
@alec-flowers alec-flowers marked this pull request as ready for review June 1, 2026 22:46
@mergify mergify Bot added ci/build nvidia bug Something isn't working labels Jun 1, 2026
Signed-off-by: Alec Flowers <aflowers@nvidia.com>
@alec-flowers alec-flowers changed the title [Bugfix][CI] Avoid CUDA init during tests.utils import [Bugfix][CI] Fix NIXL connector install scope Jun 2, 2026
@mergify mergify Bot added the kv-connector label Jun 2, 2026
@alec-flowers alec-flowers changed the title [Bugfix][CI] Fix NIXL connector install scope [Bugfix][CI] Avoid CUDA init during tests.utils import Jun 2, 2026
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Jun 3, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @alec-flowers.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@github-project-automation github-project-automation Bot moved this to Done in NVIDIA Jun 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working ci/build kv-connector needs-rebase nvidia ready ONLY add when PR is ready to merge/full CI is needed

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants