[CI/Build][BugFix] fix cuda/compat loading order issue in docker build#33116
[CI/Build][BugFix] fix cuda/compat loading order issue in docker build#33116houseroad merged 2 commits intovllm-project:mainfrom
Conversation
Signed-off-by: Pengchao Wang <wpc@fb.com>
There was a problem hiding this comment.
Code Review
This pull request fixes a CUDA library loading order issue by renaming the ld.so.conf.d configuration file to lower its priority. The change is correct and well-justified. I've also identified a potential issue with how the CUDA_VERSION is parsed throughout the Dockerfile. The current method is fragile and can fail if the version number is an integer. I've provided suggestions to make the version parsing more robust. While I've commented on the lines changed in this PR, this issue is present in other parts of the file and should be addressed consistently.
|
Thanks a lot for the fix, @wpc ! I've been very busy with MLPerf deadline recently and didn't have much time to get back to this problem. Since this PR is merged, I'll give the latest main a try on our internal dev GB200/GB300 clusters once the deadline is over, and report back if I encounter any further issues. |
Coming from vllm-project/vllm#33116 Signed-off-by: Huy Do <huydhn@gmail.com>
Coming from vllm-project/vllm#33116 Signed-off-by: Huy Do <huydhn@gmail.com>
|
This PR did not really seem to fix things correctly... |
|
Unfortunately, this fix does not work for game cards with no |
|
I still get this error with CUDA 13.1 using nightly or cu130-nightly docker builds. I have also compatibility packages installed. We have A5000 cards. |
vllm-project#33116) Signed-off-by: Pengchao Wang <wpc@fb.com> Co-authored-by: Ye (Charlotte) Qi <yeq@meta.com>
vllm-project#33116) Signed-off-by: Pengchao Wang <wpc@fb.com> Co-authored-by: Ye (Charlotte) Qi <yeq@meta.com> Signed-off-by: PiratePai <416932041@qq.com> Signed-off-by: Pai <416932041@qq.com>
vllm-project#33116) Signed-off-by: Pengchao Wang <wpc@fb.com> Co-authored-by: Ye (Charlotte) Qi <yeq@meta.com>
Purpose
In PR #30784 we persistents CUDA compat lib path with ldconfig. This caused a CUDA init issue when the docker image runs with new driver without requiring compatibility libs.
the issues is with prefix cuda compat path 00-*, the compat path now have higher priority even than normal CUDA lib
this patch remove the 00- prefix, lower down the compat lib loading priority to fix the problem
Test Plan
build new image with the fix and run follow, there should be no 830 error and result should be "True 2"
Test Result
GB200 Driver Version: 580.105.08 (does not need compat libs):
H100 Driver Version: 535.183.06 (need compat libs to run cuda13)
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.