[Bugfix][CI] Avoid CUDA init during tests.utils import#44258
Closed
alec-flowers wants to merge 8 commits into
Closed
[Bugfix][CI] Avoid CUDA init during tests.utils import#44258alec-flowers wants to merge 8 commits into
alec-flowers wants to merge 8 commits into
Conversation
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Member
|
Thanks @alec-flowers |
035ee04 to
b60baee
Compare
b60baee to
db307b9
Compare
db307b9 to
81af843
Compare
Signed-off-by: Alec Flowers <aflowers@nvidia.com>
81af843 to
cc526ce
Compare
3729912 to
cc526ce
Compare
Contributor
|
This pull request has merge conflicts that must be resolved before it can be |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes a CUDA-before-fork failure exposed by the PR #44192 CI image changes.
tests.utilsis imported during pytest collection by tests that only need helpers such ascreate_new_process_for_each_test. It had a top-level import ofvllm.model_executor.kernels.linearfor theTestFP8Layerhelper. In an image where NIXL EP and FlashInfer comm are installed, that import can initialize CUDA in the parent pytest process. Forked CUDA tests then fail with:This keeps those CUDA-touching imports lazy at their actual use sites and adds a canary to assert that importing
tests.utilsdoes not initialize CUDA. It also removes theCUDA_MODULE_LOADING=LAZYtest-image workaround added in #44192: that environment variable does not stop this failure because the import path explicitly callstorch.cuda.get_device_capability(), which initializes PyTorch CUDA anyway.Why This Surfaced Now
Short version: this is not because the logits processor test depends on NIXL. It surfaced because #44192 made the generic CI test image more release-like by installing KV connector dependencies into it.
Before #44192, the generic test image used by unrelated tests did not install
requirements/kv_connectors.txt, sonixl_epwas not available there. That meanthas_nixl_ep()returned false and the NIXL EP fused-MoE import branch was skipped during pytest collection. After #44192 setINSTALL_KV_CONNECTORS=truefor the shared test image,nixl_epbecame available to every test area using that image, including unrelated logits processor tests.So NIXL did not necessarily newly start shipping
nixl_ep; for example, the upstream release imagevllm/vllm-openai:v0.22.0-ubuntu2404already hasnixl,nixl-cu12,nixl-cu13, andnixl_ep. The change is that unrelated CI tests started running in an image with those connector packages installed.The exact import chain I observed is:
That explains why this was not hit by unrelated tests before: the top-level import in
tests.utilswas already too broad, but the CUDA-initializing branch depended on optional connector packages that were not previously installed in the shared CI test image.There were already CI steps that manually installed connector dependencies, but those were the disaggregated test steps themselves. They ran commands like
uv pip install --system -r /vllm-workspace/requirements/kv_connectors.txtand then launched the NIXL integration scripts. Those scripts startvllm serveprocesses and run the NIXL accuracy tests; they do not importtests.utils.create_new_process_for_each_testduring collection and then fork CUDA work from that same parent pytest process. Because Buildkite steps run in separate job containers, those manual installs also did not leak into unrelated logits processor test jobs.I reproduced the same underlying behavior in the upstream release image
vllm/vllm-openai:v0.22.0-ubuntu2404:Control in that same release image, simulating the old generic test image by forcing
has_nixl_ep()false:Validation
Validated against CI image:
Commands run with the local checkout mounted over
/vllm-workspace:python3 /vllm-workspace/tests/cuda/scripts/check_test_utils_no_cuda_init.py python3 -m pytest -q -s tests/cuda/test_platform_no_cuda_init.py python3 -m pytest -vv -s 'v1/logits_processors/test_correctness.py::test_logitsprocs[logitsprocs_under_test0-50-cuda:0]' python3 -m pytest -q -s --tb=short v1/logits_processors/test_correctness.py python3 -m pytest -s -x v1/kv_connector/nixl_integration/test_nixl_imports.pyFocused checks also re-run with
CUDA_MODULE_LOADINGunset. The CUDA no-init file was also run from/vllm-workspace/testswithPYTHONPATHunset to match Buildkite's test cwd:Results:
NIXL image sanity:
Notes
This is not duplicating an existing PR that I found for this specific
tests.utilsimport-time CUDA initialization. AI assistance was used to investigate, reproduce, and draft this change; the author reviewed the diff and validation output.