[Bugfix][CI] Fix ImportError: libcudart.so.12: cannot open shared object file: No such file or directory#44192
[Bugfix][CI] Fix ImportError: libcudart.so.12: cannot open shared object file: No such file or directory#44192NickLucche wants to merge 7 commits into
ImportError: libcudart.so.12: cannot open shared object file: No such file or directory#44192Conversation
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
|
tests are now failing with an mp related issue trying to swap install order |
|
Yes longer term fix is in NIXL packaging so we don't have to do this |
Signed-off-by: NickLucche <nlucches@redhat.com>
664f60c to
e442c13
Compare
|
Perhaps this would help with the latest issue: #44252. However it would probably be best to understand why the latest nixl is causing cuda to be initialized earlier than it was before. Also noticed this in the CI logs, not sure if it's related or preexisting: |
|
Ah I see @alec-flowers just opened #44258 👍 ... still not obvious how this was exposed by these changes and wasn't an issue prior to them! |
|
This pull request has merge conflicts that must be resolved before it can be |
|
Is this superseded by #44266 (merged)? |
|
@Harry-Chen @MatthewBonanni #44266 patched it for nixl tests for now. I would keep this PR open as reference of what we want to lean toward, and close it once we manage to add it |
Fix https://buildkite.com/vllm/ci/builds/69183/canvas?jid=019e8236-5969-4911-8b8b-095d157a85cf&tab=output.
Nixl>=1.1.0 installs both cu12/13.
nixl-cu13 1.2.0 ships a libcudart.so.13 linked
nixl_ep_cpp.so, and a valid concern about it was raised in the past here #39923.We're now back to being forced again to install only a single nixl-cu* to avoid nixl-cu12 nixl_ep_cpp import from taking over and look for libcudart12.I think the issue is because we reinstall nixl manually before each test on CI.
Now trying to use a single docker image with
INSTALL_KV_CONNECTORSon CI so that it matches what we ship in release.PS this is a fix to unlock CI. I think we should have a carefully thought solution that is made to last and sort out these imports issues, especially now that nixl_ep is default for eplb cc @alec-flowers