[build] fix priority of cuda-compat libraries in ld loading by youkaichao · Pull Request #34226 · vllm-project/vllm

youkaichao · 2026-02-10T09:46:54Z

Purpose

People still report Error 803: system has unsupported display driver / cuda driver combination after #33116 . It is because some docker images specify the driver so file in nvidia.conf, which gets suppressed by cuda-compat.conf.

This PR adds zzz- to cuda-compat.conf so that it's ordered lastly.

Test Plan

Test some simple code

python3 -c "import torch; print(torch.cuda.is_available()); print(torch.cuda.device_count())"

Test Result

It can pass.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: youkaichao <youkaichao@gmail.com>

gemini-code-assist

Code Review

This pull request correctly fixes an issue with CUDA library loading priority in Docker by renaming the configuration file to ensure it's loaded last. The change is applied in two necessary locations within the Dockerfile. My review highlights a significant opportunity to improve the maintainability of the Dockerfile by addressing widespread code duplication related to CUDA version string manipulation. I've suggested a refactoring to define and reuse environment variables for different CUDA version formats, which aligns with existing patterns in the file and would make it more robust.

gemini-code-assist · 2026-02-10T09:49:09Z

docker/Dockerfile

-# Ensure CUDA compatibility library is loaded
-RUN echo "/usr/local/cuda-$(echo "$CUDA_VERSION" | cut -d. -f1,2)/compat/" > /etc/ld.so.conf.d/cuda-compat.conf && ldconfig
+# Ensure CUDA compatibility library is loaded at last to avoid overriding the system libraries
+RUN echo "/usr/local/cuda-$(echo "$CUDA_VERSION" | cut -d. -f1,2)/compat/" > /etc/ld.so.conf.d/zzz-cuda-compat.conf && ldconfig


While this change is correct, the command substitution $(echo "$CUDA_VERSION" | cut -d. -f1,2) to extract the major and minor CUDA version is repeated over 20 times in this Dockerfile. This makes the file hard to maintain and prone to errors if the logic needs to be updated.

To improve maintainability, consider defining variables for the different CUDA version formats at the beginning of the build stage and reusing them. A pattern for this already exists in this file for PYTHON_VERSION_STR (lines 499-500).

You could add a RUN command in the base stage (e.g., after line 121) to define and export these variables:

RUN echo "export CUDA_VERSION_SHORT=$(echo $CUDA_VERSION | cut -d. -f1,2)" >> /etc/environment && \ echo "export CUDA_VERSION_NODOT=$(echo $CUDA_VERSION | cut -d. -f1,2 | tr -d '.')" >> /etc/environment && \ echo "export CUDA_VERSION_DASH=$(echo $CUDA_VERSION | cut -d. -f1,2 | tr '.' '-')" >> /etc/environment

Then, you could source this file in subsequent RUN commands and use the variables. For example, this line would become:

RUN . /etc/environment && echo "/usr/local/cuda-${CUDA_VERSION_SHORT}/compat/" > /etc/ld.so.conf.d/zzz-cuda-compat.conf && ldconfig

Applying this pattern throughout the Dockerfile would significantly reduce redundancy and improve readability. A similar change would be needed in the vllm-base stage.

gemini-code-assist · 2026-02-10T09:49:09Z

docker/Dockerfile

-# Ensure CUDA compatibility library is loaded
-RUN echo "/usr/local/cuda-$(echo "$CUDA_VERSION" | cut -d. -f1,2)/compat/" > /etc/ld.so.conf.d/cuda-compat.conf && ldconfig
+# Ensure CUDA compatibility library is loaded at last to avoid overriding the system libraries
+RUN echo "/usr/local/cuda-$(echo "$CUDA_VERSION" | cut -d. -f1,2)/compat/" > /etc/ld.so.conf.d/zzz-cuda-compat.conf && ldconfig


Similar to the comment on line 136, this is another instance of repeated logic for CUDA version string manipulation. Applying the suggested refactoring in this vllm-base stage as well would improve maintainability. You can add a RUN command to define and export CUDA_VERSION_SHORT, CUDA_VERSION_NODOT, and CUDA_VERSION_DASH after Python installation (e.g., after line 537) and reuse them in subsequent commands.

youkaichao · 2026-02-10T09:58:57Z

hmmm after a second thought it might break old drivers.

ehfd · 2026-02-15T05:26:55Z

This PR is ineffective. It just reverts everything to before #30784, where CUDA forward compatibility is completely disabled.

1x NVIDIA A100 with NVIDIA 575.57.08 + CUDA 13.0:

$ mv /etc/ld.so.conf.d/cuda-compat.conf /etc/ld.so.conf.d/zzz-cuda-compat.conf
$ ldconfig
$ ldconfig -p | grep libcuda
        libcudart.so.13 (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.13
        libcudart.so (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so
        libcudadebugger.so.1 (libc6,x86-64) => /usr/local/cuda-13.0/compat/libcudadebugger.so.1
        libcudadebugger.so.1 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libcudadebugger.so.1
        libcuda.so.1 (libc6,x86-64) => /usr/local/cuda-13.0/compat/libcuda.so.1
        libcuda.so.1 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libcuda.so.1
        libcuda.so (libc6,x86-64) => /usr/local/cuda-13.0/compat/libcuda.so
        libcuda.so (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libcuda.so
$ python3 -m vllm.entrypoints.openai.api_server --port "5000" --host 0.0.0.0 --download-dir /workspace/.cache/huggingface/hub --model Qwen/Qwen3-0.6B --served-model-name qwen3 --reasoning-parser deepseek_r1 --tensor-parallel-size "1" --trust-remote-code
RuntimeError: The NVIDIA driver on your system is too old (found version 12090). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx Alternatively, go to: https://pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver.

The persistent cuda-compat.conf in /etc/ld.so.conf.d/ causes Error 803 on consumer NVIDIA GPUs (GeForce, RTX) when the host driver version is newer than the container's CUDA toolkit. CUDA forward compatibility is only supported on datacenter/professional GPUs. Replace the unconditional ldconfig registration with an opt-in mechanism: - VLLM_ENABLE_CUDA_COMPATIBILITY=1 enables compat library loading - VLLM_CUDA_COMPATIBILITY_PATH overrides the default compat path - Runtime LD_LIBRARY_PATH is set before torch import in env_override.py - Default is disabled (0) so consumer GPU users are unaffected This fixes the regression introduced by the persistent cuda-compat.conf that broke systems with NVIDIA driver 580.x (CUDA 13.0 compatible). Fixes: vllm-project#32373 Related: vllm-project#33992, vllm-project#34226

The persistent cuda-compat.conf in /etc/ld.so.conf.d/ causes Error 803 on consumer NVIDIA GPUs (GeForce, RTX) when the host driver version is newer than the container's CUDA toolkit. CUDA forward compatibility is only supported on datacenter/professional GPUs. Replace the unconditional ldconfig registration with an opt-in mechanism: - VLLM_ENABLE_CUDA_COMPATIBILITY=1 enables compat library loading - VLLM_CUDA_COMPATIBILITY_PATH overrides the default compat path - Runtime LD_LIBRARY_PATH is set before torch import in env_override.py - Default is disabled (0) so consumer GPU users are unaffected This fixes the regression introduced by the persistent cuda-compat.conf that broke systems with NVIDIA driver 580.x (CUDA 13.0 compatible). Fixes: vllm-project#32373 Related: vllm-project#33992, vllm-project#34226 Signed-off-by: Andrew Mello <andrew@88plug.com>

ehfd · 2026-02-26T03:11:19Z

Superseded by #33992

fix mismatch of driver versions

6309aac

Signed-off-by: youkaichao <youkaichao@gmail.com>

mergify bot added ci/build nvidia labels Feb 10, 2026

github-project-automation bot added this to NVIDIA Feb 10, 2026

gemini-code-assist bot reviewed Feb 10, 2026

View reviewed changes

youkaichao mentioned this pull request Feb 10, 2026

[Bugfix] Fix CUDA compatibility path setting for both datacenter and consumer NVIDIA GPUs #33992

Merged

5 tasks

This was referenced Feb 18, 2026

[Bugfix] Make CUDA compat library loading opt-in to fix consumer GPUs #34821

Closed

[Bugfix] Add is_blackwell_class() for SM121/GB10 DGX Spark support #34822

Open

vllm-bot closed this in #33992 Feb 26, 2026

github-project-automation bot moved this to Done in NVIDIA Feb 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[build] fix priority of cuda-compat libraries in ld loading#34226

[build] fix priority of cuda-compat libraries in ld loading#34226
youkaichao wants to merge 1 commit intovllm-project:mainfrom
youkaichao:fix_mismatch

youkaichao commented Feb 10, 2026 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Feb 10, 2026

Uh oh!

gemini-code-assist bot Feb 10, 2026

Uh oh!

youkaichao commented Feb 10, 2026

Uh oh!

ehfd commented Feb 15, 2026 •

edited

Loading

Uh oh!

ehfd commented Feb 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

youkaichao commented Feb 10, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

youkaichao commented Feb 10, 2026

Uh oh!

ehfd commented Feb 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ehfd commented Feb 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

youkaichao commented Feb 10, 2026 •

edited by github-actions bot

Loading

ehfd commented Feb 15, 2026 •

edited

Loading