Skip to content

[Docker][Hotfix] CUDA compatibility enablement#32474

Closed
emricksini-h wants to merge 5 commits intovllm-project:mainfrom
emricksini-h:hotfix/cuda-compat
Closed

[Docker][Hotfix] CUDA compatibility enablement#32474
emricksini-h wants to merge 5 commits intovllm-project:mainfrom
emricksini-h:hotfix/cuda-compat

Conversation

@emricksini-h
Copy link
Copy Markdown
Contributor

@emricksini-h emricksini-h commented Jan 16, 2026

A change in #30784 was introduced to ensure CUDA compatibility libraries remained available even after an ldconfig cache update (which happens when apt-get is called for example). However, this caused a regression on systems where compatibility libraries are not required (when the host driver is newer than the container's CUDA version), as reported in #32373.

This PR reverts the change to /etc/ld.so.conf.d and implements a safer strategy by appending the compatibility path to LD_LIBRARY_PATH. This ensures that compatibility libraries persist after an ldconfig cache reset but are only used as a fallback.

Fixes #32373

@emricksini-h emricksini-h changed the title [Hotfix] CUDA compat [Docker][Hotfix] CUDA compatibility enablement Jan 16, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a hotfix to address a regression related to CUDA compatibility libraries in the Docker build. The changes revert a persistent modification to the system's dynamic linker configuration, opting for a safer strategy of appending the compatibility library path to the LD_LIBRARY_PATH environment variable. This ensures the libraries are available as a fallback without causing issues on systems where they are not needed. My review identifies a potential issue in the Dockerfile where an environment variable is not quoted, which could lead to build failures under certain conditions. I've provided suggestions to improve the robustness of the script.

@emricksini-h
Copy link
Copy Markdown
Contributor Author

cc @wangshangsam & @huydhn.
Let me know if you prefer to do it in 2 PRs (yours #32377 + one adding the LD_LIBRARY_PATH, or this one doing the two).

…reset on `apt-get` (vllm-project#30784)"

This reverts commit 2a60ac9.

Signed-off-by: emricksini-h <emrick.birivoutin@hcompany.ai>
Signed-off-by: emricksini-h <emrick.birivoutin@hcompany.ai>
Signed-off-by: emricksini-h <emrick.birivoutin@hcompany.ai>
@wangshangsam
Copy link
Copy Markdown
Collaborator

@emricksini-h did you test this change and it worked for you?

I'll test this branch on CUDA 13. If this works, let's merge this PR.

Could you also add Fixes #32373 in your PR description?

@emricksini-h
Copy link
Copy Markdown
Contributor Author

Yes, I've tested on my side and it's working as expected.

@wangshangsam
Copy link
Copy Markdown
Collaborator

wangshangsam commented Jan 17, 2026

Hmmm ... weirdly enough ... the current main (specifically, 8e61425ee6d0bd03d3669c148eba8b263d101273) works for me on both GH200 and GB200 just fine:
GH200:

root@0234b6d6eac5:/vllm-workspace# ldconfig -p | grep cuda
	libnvrtc.so.13 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libnvrtc.so.13
	libnvrtc-builtins.so.13.0 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libnvrtc-builtins.so.13.0
	libnvidia-ptxjitcompiler.so.1 (libc6,AArch64) => /usr/local/cuda-13.0/compat/libnvidia-ptxjitcompiler.so.1
	libnvidia-nvvm70.so.4 (libc6,AArch64) => /usr/local/cuda-13.0/compat/libnvidia-nvvm70.so.4
	libnvidia-nvvm.so.4 (libc6,AArch64) => /usr/local/cuda-13.0/compat/libnvidia-nvvm.so.4
	libnvidia-nvvm.so (libc6,AArch64) => /usr/local/cuda-13.0/compat/libnvidia-nvvm.so
	libnvidia-gpucomp.so.580.82.07 (libc6,AArch64) => /usr/local/cuda-13.0/compat/libnvidia-gpucomp.so.580.82.07
	libnvblas.so.13 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libnvblas.so.13
	libicudata.so.70 (libc6,AArch64) => /usr/lib/aarch64-linux-gnu/libicudata.so.70
	libcusparse.so.12 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libcusparse.so.12
	libcusolverMg.so.12 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libcusolverMg.so.12
	libcusolver.so.12 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libcusolver.so.12
	libcurand.so.10 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libcurand.so.10
	libcurand.so (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libcurand.so
	libcudart.so.13 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libcudart.so.13
	libcudart.so (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libcudart.so
	libcudadebugger.so.1 (libc6,AArch64) => /usr/lib/aarch64-linux-gnu/libcudadebugger.so.1
	libcudadebugger.so.1 (libc6,AArch64) => /usr/local/cuda-13.0/compat/libcudadebugger.so.1
	libcuda.so.1 (libc6,AArch64) => /usr/lib/aarch64-linux-gnu/libcuda.so.1
	libcuda.so.1 (libc6,AArch64) => /usr/local/cuda-13.0/compat/libcuda.so.1
	libcuda.so (libc6,AArch64) => /usr/local/cuda-13.0/compat/libcuda.so
	libcublasLt.so.13 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libcublasLt.so.13
	libcublas.so.13 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libcublas.so.13
root@0234b6d6eac5:/vllm-workspace# python3 -c 'import torch; torch.cuda.is_available()'

GB200:

root@nvl72122-T18:/vllm-workspace# ldconfig -p | grep cuda
	libnvrtc.so.13 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libnvrtc.so.13
	libnvrtc-builtins.so.13.0 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libnvrtc-builtins.so.13.0
	libnvidia-ptxjitcompiler.so.1 (libc6,AArch64) => /usr/local/cuda-13.0/compat/libnvidia-ptxjitcompiler.so.1
	libnvidia-nvvm70.so.4 (libc6,AArch64) => /usr/local/cuda-13.0/compat/libnvidia-nvvm70.so.4
	libnvidia-nvvm.so.4 (libc6,AArch64) => /usr/local/cuda-13.0/compat/libnvidia-nvvm.so.4
	libnvidia-nvvm.so (libc6,AArch64) => /usr/local/cuda-13.0/compat/libnvidia-nvvm.so
	libnvidia-gpucomp.so.580.82.07 (libc6,AArch64) => /usr/local/cuda-13.0/compat/libnvidia-gpucomp.so.580.82.07
	libnvblas.so.13 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libnvblas.so.13
	libicudata.so.70 (libc6,AArch64) => /usr/lib/aarch64-linux-gnu/libicudata.so.70
	libcusparse.so.12 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libcusparse.so.12
	libcusolverMg.so.12 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libcusolverMg.so.12
	libcusolver.so.12 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libcusolver.so.12
	libcurand.so.10 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libcurand.so.10
	libcurand.so (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libcurand.so
	libcudart.so.13 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libcudart.so.13
	libcudart.so (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libcudart.so
	libcudadebugger.so.1 (libc6,AArch64) => /usr/lib/aarch64-linux-gnu/libcudadebugger.so.1
	libcudadebugger.so.1 (libc6,AArch64) => /usr/local/cuda-13.0/compat/libcudadebugger.so.1
	libcuda.so.1 (libc6,AArch64) => /usr/lib/aarch64-linux-gnu/libcuda.so.1
	libcuda.so.1 (libc6,AArch64) => /usr/local/cuda-13.0/compat/libcuda.so.1
	libcuda.so (libc6,AArch64) => /usr/local/cuda-13.0/compat/libcuda.so
	libcublasLt.so.13 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libcublasLt.so.13
	libcublas.so.13 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libcublas.so.13
root@nvl72122-T18:/vllm-workspace# python3 -c 'import torch; torch.cuda.is_available()'

However, this PR works for me on GH200 but NOT GB200:
GH200:

root@604913cbe23b:/vllm-workspace# ldconfig -p | grep cuda
	libnvrtc.so.13 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libnvrtc.so.13
	libnvrtc-builtins.so.13.0 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libnvrtc-builtins.so.13.0
	libnvblas.so.13 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libnvblas.so.13
	libicudata.so.70 (libc6,AArch64) => /usr/lib/aarch64-linux-gnu/libicudata.so.70
	libcusparse.so.12 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libcusparse.so.12
	libcusolverMg.so.12 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libcusolverMg.so.12
	libcusolver.so.12 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libcusolver.so.12
	libcurand.so.10 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libcurand.so.10
	libcurand.so (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libcurand.so
	libcudart.so.13 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libcudart.so.13
	libcudart.so (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libcudart.so
	libcudadebugger.so.1 (libc6,AArch64) => /usr/lib/aarch64-linux-gnu/libcudadebugger.so.1
	libcuda.so.1 (libc6,AArch64) => /usr/lib/aarch64-linux-gnu/libcuda.so.1
	libcublasLt.so.13 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libcublasLt.so.13
	libcublas.so.13 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libcublas.so.13
root@604913cbe23b:/vllm-workspace# python3 -c 'import torch; torch.cuda.is_available()'

GB200:

root@nvl72122-T18:/vllm-workspace# ldconfig -p | grep cuda
	libnvrtc.so.13 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libnvrtc.so.13
	libnvrtc-builtins.so.13.0 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libnvrtc-builtins.so.13.0
	libnvblas.so.13 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libnvblas.so.13
	libicudata.so.70 (libc6,AArch64) => /usr/lib/aarch64-linux-gnu/libicudata.so.70
	libcusparse.so.12 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libcusparse.so.12
	libcusolverMg.so.12 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libcusolverMg.so.12
	libcusolver.so.12 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libcusolver.so.12
	libcurand.so.10 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libcurand.so.10
	libcurand.so (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libcurand.so
	libcudart.so.13 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libcudart.so.13
	libcudart.so (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libcudart.so
	libcudadebugger.so.1 (libc6,AArch64) => /usr/lib/aarch64-linux-gnu/libcudadebugger.so.1
	libcuda.so.1 (libc6,AArch64) => /usr/lib/aarch64-linux-gnu/libcuda.so.1
	libcublasLt.so.13 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libcublasLt.so.13
	libcublas.so.13 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libcublas.so.13
root@nvl72122-T18:/vllm-workspace# python3 -c 'import torch; torch.cuda.is_available()'
/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:182: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 803: system has unsupported display driver / cuda driver combination (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:119.)
  return torch._C._cuda_getDeviceCount() > 0

I can only assume the difference between what @huydhn saw vs. what I saw on main is caused by some environment differences between the machines we are using, but I thought that the share libraries are entirely contained within the image. @huydhn were you building another image on top of the base vLLM image, which installs CUDA 12.9 or something?
But yeah, as far as this PR is concerned, I don't think we can merge it as-is.

@emricksini-h emricksini-h requested a review from huydhn January 19, 2026 13:42
@huydhn
Copy link
Copy Markdown
Contributor

huydhn commented Jan 19, 2026

Once the Docker image from this PR is ready, I can try to run a round of benchmark with it and report back

@wangshangsam
Copy link
Copy Markdown
Collaborator

@wangshangsam
Copy link
Copy Markdown
Collaborator

wangshangsam commented Jan 24, 2026

Actually, in one of the internal gb300 dev machines we have (which is different from the one I was using previously), I am seeing this problem appearing on main, so I guess this is indeed machine-dependent.

root@a98df7abfc4e:/vllm-workspace# ldconfig -p | grep cuda                                                                                                                                                                                  [2/1909]
        libnvrtc.so.13 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libnvrtc.so.13                                                                                                                                                     
        libnvrtc-builtins.so.13.0 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libnvrtc-builtins.so.13.0                                                                                                                               
        libnvidia-ptxjitcompiler.so.1 (libc6,AArch64) => /usr/local/cuda-13.0/compat/libnvidia-ptxjitcompiler.so.1
        libnvidia-nvvm70.so.4 (libc6,AArch64) => /usr/local/cuda-13.0/compat/libnvidia-nvvm70.so.4
        libnvidia-nvvm.so.4 (libc6,AArch64) => /usr/local/cuda-13.0/compat/libnvidia-nvvm.so.4
        libnvidia-nvvm.so (libc6,AArch64) => /usr/local/cuda-13.0/compat/libnvidia-nvvm.so
        libnvidia-gpucomp.so.580.82.07 (libc6,AArch64) => /usr/local/cuda-13.0/compat/libnvidia-gpucomp.so.580.82.07
        libnvblas.so.13 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libnvblas.so.13
        libicudata.so.70 (libc6,AArch64) => /lib/aarch64-linux-gnu/libicudata.so.70
        libcusparse.so.12 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libcusparse.so.12
        libcusolverMg.so.12 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libcusolverMg.so.12
        libcusolver.so.12 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libcusolver.so.12
        libcurand.so.10 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libcurand.so.10
        libcurand.so (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libcurand.so
        libcudart.so.13 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libcudart.so.13
        libcudart.so (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libcudart.so
        libcudadebugger.so.1 (libc6,AArch64) => /usr/local/cuda-13.0/compat/libcudadebugger.so.1
        libcudadebugger.so.1 (libc6,AArch64) => /usr/lib/libcudadebugger.so.1
        libcuda.so.1 (libc6,AArch64) => /usr/local/cuda-13.0/compat/libcuda.so.1
        libcuda.so.1 (libc6,AArch64) => /usr/lib/libcuda.so.1
        libcuda.so (libc6,AArch64) => /usr/local/cuda-13.0/compat/libcuda.so
        libcuda.so (libc6,AArch64) => /usr/lib/libcuda.so
        libcublasLt.so.13 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libcublasLt.so.13
        libcublas.so.13 (libc6,AArch64) => /usr/local/cuda/targets/sbsa-linux/lib/libcublas.so.13
root@a98df7abfc4e:/vllm-workspace# python3 -c 'import torch; torch.cuda.is_available()'
/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:182: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 803: system has unsupported display driver / cuda driver combination (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:119.)
  return torch._C._cuda_getDeviceCount() > 0
root@a98df7abfc4e:/vllm-workspace# nvidia-smi 
Sat Jan 24 13:21:24 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05              Driver Version: 580.95.05      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GB300                   On  |   00000008:06:00.0 Off |                    0 |
| N/A   26C    P0            172W / 1400W |       0MiB / 284208MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GB300                   On  |   00000009:06:00.0 Off |                    0 |
| N/A   25C    P0            172W / 1400W |       0MiB / 284208MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA GB300                   On  |   00000018:06:00.0 Off |                    0 |
| N/A   25C    P0            168W / 1400W |       0MiB / 284208MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA GB300                   On  |   00000019:06:00.0 Off |                    0 |
| N/A   25C    P0            172W / 1400W |       0MiB / 284208MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |

@emricksini-h could you change this PR to simply just reverting your previous PR? And let's just get it merged to prevent further breakage for now.

@emricksini-h
Copy link
Copy Markdown
Contributor Author

Closing this in favor of #33116. It resolves the same initial issue but implements a better approach by lowering the priority of the CUDA compatibility libs. This allows the system to prioritize newer CUDA drivers when available, avoiding the initialization errors we saw here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

[Bug]: Fail to load vLLM on new NVIDIA driver

3 participants