Skip to content

[Bugfix][CI] Fix ImportError: libcudart.so.12: cannot open shared object file: No such file or directory#44192

Open
NickLucche wants to merge 7 commits into
vllm-project:mainfrom
NickLucche:fix-libcudart-nixlep
Open

[Bugfix][CI] Fix ImportError: libcudart.so.12: cannot open shared object file: No such file or directory#44192
NickLucche wants to merge 7 commits into
vllm-project:mainfrom
NickLucche:fix-libcudart-nixlep

Conversation

@NickLucche
Copy link
Copy Markdown
Member

@NickLucche NickLucche commented Jun 1, 2026

Fix https://buildkite.com/vllm/ci/builds/69183/canvas?jid=019e8236-5969-4911-8b8b-095d157a85cf&tab=output.

Nixl>=1.1.0 installs both cu12/13.
nixl-cu13 1.2.0 ships a libcudart.so.13 linked nixl_ep_cpp.so, and a valid concern about it was raised in the past here #39923.

We're now back to being forced again to install only a single nixl-cu* to avoid nixl-cu12 nixl_ep_cpp import from taking over and look for libcudart12.
I think the issue is because we reinstall nixl manually before each test on CI.
Now trying to use a single docker image with INSTALL_KV_CONNECTORS on CI so that it matches what we ship in release.

PS this is a fix to unlock CI. I think we should have a carefully thought solution that is made to last and sort out these imports issues, especially now that nixl_ep is default for eplb cc @alec-flowers

Signed-off-by: NickLucche <nlucches@redhat.com>
@NickLucche NickLucche added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 1, 2026
@mergify mergify Bot added ci/build nvidia bug Something isn't working labels Jun 1, 2026
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
@NickLucche
Copy link
Copy Markdown
Member Author

tests are now failing with an mp related issue

FAILED v1/logits_processors/test_correctness.py::test_logitsprocs[logitsprocs_under_test0-50-cuda:0] - RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method

trying to swap install order

@alec-flowers
Copy link
Copy Markdown
Contributor

Yes longer term fix is in NIXL packaging so we don't have to do this
ai-dynamo/nixl#1646 cc @ovidiusm

Signed-off-by: NickLucche <nlucches@redhat.com>
@NickLucche NickLucche force-pushed the fix-libcudart-nixlep branch from 664f60c to e442c13 Compare June 1, 2026 16:37
@njhill
Copy link
Copy Markdown
Member

njhill commented Jun 1, 2026

Perhaps this would help with the latest issue: #44252.

However it would probably be best to understand why the latest nixl is causing cuda to be initialized earlier than it was before.

Also noticed this in the CI logs, not sure if it's related or preexisting:

  --------------------------------------------------------------------------------
    CuPy may not function correctly because multiple CuPy packages are installed
    in your environment:
      cupy-cuda12x, cupy-cuda13x
    Follow these steps to resolve this issue:
      1. For all packages listed above, run the following command to remove all
         existing CuPy installations:
           $ pip uninstall <package_name>
        If you previously installed CuPy via conda, also run the following:
           $ conda uninstall cupy
      2. Install the appropriate CuPy package.
         Refer to the Installation Guide for detailed instructions.
           https://docs.cupy.dev/en/stable/install.html
  --------------------------------------------------------------------------------

@njhill
Copy link
Copy Markdown
Member

njhill commented Jun 1, 2026

Ah I see @alec-flowers just opened #44258 👍 ... still not obvious how this was exposed by these changes and wasn't an issue prior to them!

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Jun 2, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @NickLucche.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label Jun 2, 2026
@MatthewBonanni
Copy link
Copy Markdown
Member

MatthewBonanni commented Jun 2, 2026

Is this superseded by #44266 (merged)?

@Harry-Chen
Copy link
Copy Markdown
Member

Is this superseded by #44266 (merged)?

Looks #44266 fixed the CI image for tests, but we may still have wrong dependencies in published release images (not confirmed).

@NickLucche
Copy link
Copy Markdown
Member Author

@Harry-Chen @MatthewBonanni #44266 patched it for nixl tests for now.
I think in the long run we want to go this way and minimize the differences between test and release images.
Differences are big enough right now that adding connector deps causes all sorts of issues on test image, so that's not super reassuring..

I would keep this PR open as reference of what we want to lean toward, and close it once we manage to add it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working ci/build needs-rebase nvidia ready ONLY add when PR is ready to merge/full CI is needed

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

5 participants