[Bugfix][CI] Avoid CUDA init during tests.utils import by alec-flowers · Pull Request #44258 · vllm-project/vllm

alec-flowers · 2026-06-01T22:08:59Z

Summary

Fixes a CUDA-before-fork failure exposed by the PR #44192 CI image changes.

tests.utils is imported during pytest collection by tests that only need helpers such as create_new_process_for_each_test. It had a top-level import of vllm.model_executor.kernels.linear for the TestFP8Layer helper. In an image where NIXL EP and FlashInfer comm are installed, that import can initialize CUDA in the parent pytest process. Forked CUDA tests then fail with:

RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method

This keeps those CUDA-touching imports lazy at their actual use sites and adds a canary to assert that importing tests.utils does not initialize CUDA. It also removes the CUDA_MODULE_LOADING=LAZY test-image workaround added in #44192: that environment variable does not stop this failure because the import path explicitly calls torch.cuda.get_device_capability(), which initializes PyTorch CUDA anyway.

Why This Surfaced Now

Short version: this is not because the logits processor test depends on NIXL. It surfaced because #44192 made the generic CI test image more release-like by installing KV connector dependencies into it.

Before #44192, the generic test image used by unrelated tests did not install requirements/kv_connectors.txt, so nixl_ep was not available there. That meant has_nixl_ep() returned false and the NIXL EP fused-MoE import branch was skipped during pytest collection. After #44192 set INSTALL_KV_CONNECTORS=true for the shared test image, nixl_ep became available to every test area using that image, including unrelated logits processor tests.

So NIXL did not necessarily newly start shipping nixl_ep; for example, the upstream release image vllm/vllm-openai:v0.22.0-ubuntu2404 already has nixl, nixl-cu12, nixl-cu13, and nixl_ep. The change is that unrelated CI tests started running in an image with those connector packages installed.

The exact import chain I observed is:

tests.utils
  -> vllm.model_executor.kernels.linear
  -> vllm.model_executor.kernels.linear.mixed_precision.marlin
  -> vllm.model_executor.layers.quantization.utils.marlin_utils
  -> vllm.model_executor.layers.fused_moe
  -> vllm.model_executor.layers.fused_moe.all2all_utils
  -> .prepare_finalize.nixl_ep          # only when has_nixl_ep() is true
  -> vllm.distributed.device_communicators.all2all
  -> has_flashinfer_nvlink_two_sided()
  -> has_flashinfer_comm()
  -> importlib.util.find_spec("flashinfer.comm")
  -> imports parent flashinfer package
  -> flashinfer.jit.env.CompilationContext()
  -> torch.cuda.get_device_capability()
  -> torch.cuda._lazy_init()

That explains why this was not hit by unrelated tests before: the top-level import in tests.utils was already too broad, but the CUDA-initializing branch depended on optional connector packages that were not previously installed in the shared CI test image.

There were already CI steps that manually installed connector dependencies, but those were the disaggregated test steps themselves. They ran commands like uv pip install --system -r /vllm-workspace/requirements/kv_connectors.txt and then launched the NIXL integration scripts. Those scripts start vllm serve processes and run the NIXL accuracy tests; they do not import tests.utils.create_new_process_for_each_test during collection and then fork CUDA work from that same parent pytest process. Because Buildkite steps run in separate job containers, those manual installs also did not leak into unrelated logits processor test jobs.

I reproduced the same underlying behavior in the upstream release image vllm/vllm-openai:v0.22.0-ubuntu2404:

dist vllm: 0.22.0
dist nixl: 1.1.0
dist nixl-cu12: 1.1.0
dist nixl-cu13: 1.1.0
dist cupy-cuda13x: 14.0.1
dist flashinfer-python: 0.6.11.post2
before import False
from vllm.model_executor.kernels.linear import _KernelT, init_fp8_linear_kernel
# stack reaches flashinfer/compilation_context.py: torch.cuda.get_device_capability()
after import True

Control in that same release image, simulating the old generic test image by forcing has_nixl_ep() false:

before import False
pretend nixl_ep is absent
after import False

Validation

Validated against CI image:

public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:e442c13f36f707e8de75043506713b454448ba75

Commands run with the local checkout mounted over /vllm-workspace:

python3 /vllm-workspace/tests/cuda/scripts/check_test_utils_no_cuda_init.py
python3 -m pytest -q -s tests/cuda/test_platform_no_cuda_init.py
python3 -m pytest -vv -s 'v1/logits_processors/test_correctness.py::test_logitsprocs[logitsprocs_under_test0-50-cuda:0]'
python3 -m pytest -q -s --tb=short v1/logits_processors/test_correctness.py
python3 -m pytest -s -x v1/kv_connector/nixl_integration/test_nixl_imports.py

Focused checks also re-run with CUDA_MODULE_LOADING unset. The CUDA no-init file was also run from /vllm-workspace/tests with PYTHONPATH unset to match Buildkite's test cwd:

unset CUDA_MODULE_LOADING
python3 /vllm-workspace/tests/cuda/scripts/check_test_utils_no_cuda_init.py
python3 -m pytest -q -s tests/cuda/test_platform_no_cuda_init.py
python3 -m pytest -vv -s 'v1/logits_processors/test_correctness.py::test_logitsprocs[logitsprocs_under_test0-50-cuda:0]'

Results:

check_test_utils_no_cuda_init.py: OK
tests/cuda/test_platform_no_cuda_init.py: 3 passed
reported logits processor node: 1 passed
v1/logits_processors/test_correctness.py: 28 passed
NIXL import canary: 1 passed
check_test_utils_no_cuda_init.py with CUDA_MODULE_LOADING unset: OK
tests/cuda/test_platform_no_cuda_init.py with CUDA_MODULE_LOADING unset: 3 passed
reported logits processor node with CUDA_MODULE_LOADING unset: 1 passed

NIXL image sanity:

torch.version.cuda: 13.0
nixl: 1.2.0
nixl-cu12: not installed
nixl-cu13: 1.2.0
nixl_ep_cpp links to libcudart.so.13

Notes

This is not duplicating an existing PR that I found for this specific tests.utils import-time CUDA initialization. AI assistance was used to investigate, reproduce, and draft this change; the author reviewed the diff and validation output.

Signed-off-by: NickLucche <nlucches@redhat.com>

njhill · 2026-06-01T22:12:21Z

Thanks @alec-flowers

Signed-off-by: Alec Flowers <aflowers@nvidia.com>

mergify · 2026-06-03T06:15:08Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @alec-flowers.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

NickLucche added 7 commits June 1, 2026 12:07

init

c381854

Signed-off-by: NickLucche <nlucches@redhat.com>

do not reinstall nixl

87a9abe

Signed-off-by: NickLucche <nlucches@redhat.com>

install connectors for CI runs in the image

7f45755

Signed-off-by: NickLucche <nlucches@redhat.com>

wrong target image

6c1cd69

Signed-off-by: NickLucche <nlucches@redhat.com>

install connector in test image dockerfile

25ba47d

Signed-off-by: NickLucche <nlucches@redhat.com>

swap order

a08d1df

Signed-off-by: NickLucche <nlucches@redhat.com>

lazy cuda loading

e442c13

Signed-off-by: NickLucche <nlucches@redhat.com>

njhill mentioned this pull request Jun 1, 2026

[Bugfix][CI] Fix ImportError: libcudart.so.12: cannot open shared object file: No such file or directory #44192

Open

alec-flowers force-pushed the pr-44192-debug branch from 035ee04 to b60baee Compare June 1, 2026 22:29

njhill added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 1, 2026

alec-flowers marked this pull request as ready for review June 1, 2026 22:46

alec-flowers requested review from Harry-Chen and khluu as code owners June 1, 2026 22:46

alec-flowers force-pushed the pr-44192-debug branch from b60baee to db307b9 Compare June 1, 2026 22:51

mergify Bot added ci/build nvidia bug Something isn't working labels Jun 1, 2026

github-project-automation Bot added this to NVIDIA Jun 1, 2026

alec-flowers force-pushed the pr-44192-debug branch from db307b9 to 81af843 Compare June 1, 2026 23:13

Avoid CUDA init during tests.utils import

cc526ce

Signed-off-by: Alec Flowers <aflowers@nvidia.com>

alec-flowers force-pushed the pr-44192-debug branch from 81af843 to cc526ce Compare June 1, 2026 23:19

alec-flowers changed the title ~~[Bugfix][CI] Avoid CUDA init during tests.utils import~~ [Bugfix][CI] Fix NIXL connector install scope Jun 2, 2026

mergify Bot added the kv-connector label Jun 2, 2026

alec-flowers force-pushed the pr-44192-debug branch from 3729912 to cc526ce Compare June 2, 2026 00:28

alec-flowers changed the title ~~[Bugfix][CI] Fix NIXL connector install scope~~ [Bugfix][CI] Avoid CUDA init during tests.utils import Jun 2, 2026

khluu mentioned this pull request Jun 2, 2026

Relax CuPy constraint to only exclude 14.1.0 #44284

Open

mergify Bot added the needs-rebase label Jun 3, 2026

alec-flowers closed this Jun 3, 2026

github-project-automation Bot moved this to Done in NVIDIA Jun 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix][CI] Avoid CUDA init during tests.utils import#44258

[Bugfix][CI] Avoid CUDA init during tests.utils import#44258
alec-flowers wants to merge 8 commits into
vllm-project:mainfrom
alec-flowers:pr-44192-debug

alec-flowers commented Jun 1, 2026 •

edited

Loading

Uh oh!

njhill commented Jun 1, 2026

Uh oh!

mergify Bot commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

alec-flowers commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why This Surfaced Now

Validation

Notes

Uh oh!

njhill commented Jun 1, 2026

Uh oh!

mergify Bot commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

alec-flowers commented Jun 1, 2026 •

edited

Loading