[Docker][Hotfix] CUDA compatibility enablement#32474
[Docker][Hotfix] CUDA compatibility enablement#32474emricksini-h wants to merge 5 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request is a hotfix to address a regression related to CUDA compatibility libraries in the Docker build. The changes revert a persistent modification to the system's dynamic linker configuration, opting for a safer strategy of appending the compatibility library path to the LD_LIBRARY_PATH environment variable. This ensures the libraries are available as a fallback without causing issues on systems where they are not needed. My review identifies a potential issue in the Dockerfile where an environment variable is not quoted, which could lead to build failures under certain conditions. I've provided suggestions to improve the robustness of the script.
|
cc @wangshangsam & @huydhn. |
…reset on `apt-get` (vllm-project#30784)" This reverts commit 2a60ac9. Signed-off-by: emricksini-h <emrick.birivoutin@hcompany.ai>
Signed-off-by: emricksini-h <emrick.birivoutin@hcompany.ai>
820b0be to
c4d639c
Compare
Signed-off-by: emricksini-h <emrick.birivoutin@hcompany.ai>
|
@emricksini-h did you test this change and it worked for you? I'll test this branch on CUDA 13. If this works, let's merge this PR. Could you also add |
|
Yes, I've tested on my side and it's working as expected. |
|
Hmmm ... weirdly enough ... the current root@0234b6d6eac5:/vllm-workspace# ldconfig -p | grep cuda
libnvrtc.so.13 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libnvrtc.so.13
libnvrtc-builtins.so.13.0 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libnvrtc-builtins.so.13.0
libnvidia-ptxjitcompiler.so.1 (libc6,AArch64) => /usr/local/cuda-13.0/compat/libnvidia-ptxjitcompiler.so.1
libnvidia-nvvm70.so.4 (libc6,AArch64) => /usr/local/cuda-13.0/compat/libnvidia-nvvm70.so.4
libnvidia-nvvm.so.4 (libc6,AArch64) => /usr/local/cuda-13.0/compat/libnvidia-nvvm.so.4
libnvidia-nvvm.so (libc6,AArch64) => /usr/local/cuda-13.0/compat/libnvidia-nvvm.so
libnvidia-gpucomp.so.580.82.07 (libc6,AArch64) => /usr/local/cuda-13.0/compat/libnvidia-gpucomp.so.580.82.07
libnvblas.so.13 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libnvblas.so.13
libicudata.so.70 (libc6,AArch64) => /usr/lib/aarch64-linux-gnu/libicudata.so.70
libcusparse.so.12 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libcusparse.so.12
libcusolverMg.so.12 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libcusolverMg.so.12
libcusolver.so.12 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libcusolver.so.12
libcurand.so.10 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libcurand.so.10
libcurand.so (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libcurand.so
libcudart.so.13 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libcudart.so.13
libcudart.so (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libcudart.so
libcudadebugger.so.1 (libc6,AArch64) => /usr/lib/aarch64-linux-gnu/libcudadebugger.so.1
libcudadebugger.so.1 (libc6,AArch64) => /usr/local/cuda-13.0/compat/libcudadebugger.so.1
libcuda.so.1 (libc6,AArch64) => /usr/lib/aarch64-linux-gnu/libcuda.so.1
libcuda.so.1 (libc6,AArch64) => /usr/local/cuda-13.0/compat/libcuda.so.1
libcuda.so (libc6,AArch64) => /usr/local/cuda-13.0/compat/libcuda.so
libcublasLt.so.13 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libcublasLt.so.13
libcublas.so.13 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libcublas.so.13
root@0234b6d6eac5:/vllm-workspace# python3 -c 'import torch; torch.cuda.is_available()'GB200: root@nvl72122-T18:/vllm-workspace# ldconfig -p | grep cuda
libnvrtc.so.13 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libnvrtc.so.13
libnvrtc-builtins.so.13.0 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libnvrtc-builtins.so.13.0
libnvidia-ptxjitcompiler.so.1 (libc6,AArch64) => /usr/local/cuda-13.0/compat/libnvidia-ptxjitcompiler.so.1
libnvidia-nvvm70.so.4 (libc6,AArch64) => /usr/local/cuda-13.0/compat/libnvidia-nvvm70.so.4
libnvidia-nvvm.so.4 (libc6,AArch64) => /usr/local/cuda-13.0/compat/libnvidia-nvvm.so.4
libnvidia-nvvm.so (libc6,AArch64) => /usr/local/cuda-13.0/compat/libnvidia-nvvm.so
libnvidia-gpucomp.so.580.82.07 (libc6,AArch64) => /usr/local/cuda-13.0/compat/libnvidia-gpucomp.so.580.82.07
libnvblas.so.13 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libnvblas.so.13
libicudata.so.70 (libc6,AArch64) => /usr/lib/aarch64-linux-gnu/libicudata.so.70
libcusparse.so.12 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libcusparse.so.12
libcusolverMg.so.12 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libcusolverMg.so.12
libcusolver.so.12 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libcusolver.so.12
libcurand.so.10 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libcurand.so.10
libcurand.so (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libcurand.so
libcudart.so.13 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libcudart.so.13
libcudart.so (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libcudart.so
libcudadebugger.so.1 (libc6,AArch64) => /usr/lib/aarch64-linux-gnu/libcudadebugger.so.1
libcudadebugger.so.1 (libc6,AArch64) => /usr/local/cuda-13.0/compat/libcudadebugger.so.1
libcuda.so.1 (libc6,AArch64) => /usr/lib/aarch64-linux-gnu/libcuda.so.1
libcuda.so.1 (libc6,AArch64) => /usr/local/cuda-13.0/compat/libcuda.so.1
libcuda.so (libc6,AArch64) => /usr/local/cuda-13.0/compat/libcuda.so
libcublasLt.so.13 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libcublasLt.so.13
libcublas.so.13 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libcublas.so.13
root@nvl72122-T18:/vllm-workspace# python3 -c 'import torch; torch.cuda.is_available()'However, this PR works for me on GH200 but NOT GB200: root@604913cbe23b:/vllm-workspace# ldconfig -p | grep cuda
libnvrtc.so.13 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libnvrtc.so.13
libnvrtc-builtins.so.13.0 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libnvrtc-builtins.so.13.0
libnvblas.so.13 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libnvblas.so.13
libicudata.so.70 (libc6,AArch64) => /usr/lib/aarch64-linux-gnu/libicudata.so.70
libcusparse.so.12 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libcusparse.so.12
libcusolverMg.so.12 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libcusolverMg.so.12
libcusolver.so.12 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libcusolver.so.12
libcurand.so.10 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libcurand.so.10
libcurand.so (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libcurand.so
libcudart.so.13 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libcudart.so.13
libcudart.so (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libcudart.so
libcudadebugger.so.1 (libc6,AArch64) => /usr/lib/aarch64-linux-gnu/libcudadebugger.so.1
libcuda.so.1 (libc6,AArch64) => /usr/lib/aarch64-linux-gnu/libcuda.so.1
libcublasLt.so.13 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libcublasLt.so.13
libcublas.so.13 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libcublas.so.13
root@604913cbe23b:/vllm-workspace# python3 -c 'import torch; torch.cuda.is_available()'GB200: root@nvl72122-T18:/vllm-workspace# ldconfig -p | grep cuda
libnvrtc.so.13 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libnvrtc.so.13
libnvrtc-builtins.so.13.0 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libnvrtc-builtins.so.13.0
libnvblas.so.13 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libnvblas.so.13
libicudata.so.70 (libc6,AArch64) => /usr/lib/aarch64-linux-gnu/libicudata.so.70
libcusparse.so.12 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libcusparse.so.12
libcusolverMg.so.12 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libcusolverMg.so.12
libcusolver.so.12 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libcusolver.so.12
libcurand.so.10 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libcurand.so.10
libcurand.so (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libcurand.so
libcudart.so.13 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libcudart.so.13
libcudart.so (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libcudart.so
libcudadebugger.so.1 (libc6,AArch64) => /usr/lib/aarch64-linux-gnu/libcudadebugger.so.1
libcuda.so.1 (libc6,AArch64) => /usr/lib/aarch64-linux-gnu/libcuda.so.1
libcublasLt.so.13 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libcublasLt.so.13
libcublas.so.13 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libcublas.so.13
root@nvl72122-T18:/vllm-workspace# python3 -c 'import torch; torch.cuda.is_available()'
/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:182: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 803: system has unsupported display driver / cuda driver combination (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:119.)
return torch._C._cuda_getDeviceCount() > 0I can only assume the difference between what @huydhn saw vs. what I saw on |
|
Once the Docker image from this PR is ready, I can try to run a round of benchmark with it and report back |
@huydhn I uploaded the two images to:
|
|
Actually, in one of the internal gb300 dev machines we have (which is different from the one I was using previously), I am seeing this problem appearing on root@a98df7abfc4e:/vllm-workspace# ldconfig -p | grep cuda [2/1909]
libnvrtc.so.13 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libnvrtc.so.13
libnvrtc-builtins.so.13.0 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libnvrtc-builtins.so.13.0
libnvidia-ptxjitcompiler.so.1 (libc6,AArch64) => /usr/local/cuda-13.0/compat/libnvidia-ptxjitcompiler.so.1
libnvidia-nvvm70.so.4 (libc6,AArch64) => /usr/local/cuda-13.0/compat/libnvidia-nvvm70.so.4
libnvidia-nvvm.so.4 (libc6,AArch64) => /usr/local/cuda-13.0/compat/libnvidia-nvvm.so.4
libnvidia-nvvm.so (libc6,AArch64) => /usr/local/cuda-13.0/compat/libnvidia-nvvm.so
libnvidia-gpucomp.so.580.82.07 (libc6,AArch64) => /usr/local/cuda-13.0/compat/libnvidia-gpucomp.so.580.82.07
libnvblas.so.13 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libnvblas.so.13
libicudata.so.70 (libc6,AArch64) => /lib/aarch64-linux-gnu/libicudata.so.70
libcusparse.so.12 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libcusparse.so.12
libcusolverMg.so.12 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libcusolverMg.so.12
libcusolver.so.12 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libcusolver.so.12
libcurand.so.10 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libcurand.so.10
libcurand.so (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libcurand.so
libcudart.so.13 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libcudart.so.13
libcudart.so (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libcudart.so
libcudadebugger.so.1 (libc6,AArch64) => /usr/local/cuda-13.0/compat/libcudadebugger.so.1
libcudadebugger.so.1 (libc6,AArch64) => /usr/lib/libcudadebugger.so.1
libcuda.so.1 (libc6,AArch64) => /usr/local/cuda-13.0/compat/libcuda.so.1
libcuda.so.1 (libc6,AArch64) => /usr/lib/libcuda.so.1
libcuda.so (libc6,AArch64) => /usr/local/cuda-13.0/compat/libcuda.so
libcuda.so (libc6,AArch64) => /usr/lib/libcuda.so
libcublasLt.so.13 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libcublasLt.so.13
libcublas.so.13 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libcublas.so.13
root@a98df7abfc4e:/vllm-workspace# python3 -c 'import torch; torch.cuda.is_available()'
/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:182: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 803: system has unsupported display driver / cuda driver combination (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:119.)
return torch._C._cuda_getDeviceCount() > 0
root@a98df7abfc4e:/vllm-workspace# nvidia-smi
Sat Jan 24 13:21:24 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05 Driver Version: 580.95.05 CUDA Version: 13.0 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GB300 On | 00000008:06:00.0 Off | 0 |
| N/A 26C P0 172W / 1400W | 0MiB / 284208MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA GB300 On | 00000009:06:00.0 Off | 0 |
| N/A 25C P0 172W / 1400W | 0MiB / 284208MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA GB300 On | 00000018:06:00.0 Off | 0 |
| N/A 25C P0 168W / 1400W | 0MiB / 284208MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA GB300 On | 00000019:06:00.0 Off | 0 |
| N/A 25C P0 172W / 1400W | 0MiB / 284208MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |@emricksini-h could you change this PR to simply just reverting your previous PR? And let's just get it merged to prevent further breakage for now. |
|
Closing this in favor of #33116. It resolves the same initial issue but implements a better approach by lowering the priority of the CUDA compatibility libs. This allows the system to prioritize newer CUDA drivers when available, avoiding the initialization errors we saw here. |
A change in #30784 was introduced to ensure CUDA compatibility libraries remained available even after an
ldconfigcache update (which happens whenapt-getis called for example). However, this caused a regression on systems where compatibility libraries are not required (when the host driver is newer than the container's CUDA version), as reported in #32373.This PR reverts the change to
/etc/ld.so.conf.dand implements a safer strategy by appending the compatibility path toLD_LIBRARY_PATH. This ensures that compatibility libraries persist after anldconfigcache reset but are only used as a fallback.Fixes #32373