Skip to content

[build] fix priority of cuda-compat libraries in ld loading#34226

Closed
youkaichao wants to merge 1 commit intovllm-project:mainfrom
youkaichao:fix_mismatch
Closed

[build] fix priority of cuda-compat libraries in ld loading#34226
youkaichao wants to merge 1 commit intovllm-project:mainfrom
youkaichao:fix_mismatch

Conversation

@youkaichao
Copy link
Copy Markdown
Member

@youkaichao youkaichao commented Feb 10, 2026

Purpose

People still report Error 803: system has unsupported display driver / cuda driver combination after #33116 . It is because some docker images specify the driver so file in nvidia.conf, which gets suppressed by cuda-compat.conf.

This PR adds zzz- to cuda-compat.conf so that it's ordered lastly.

a119df1a223395c6b8d6145b8913cd2c 4e92deee226670978ce91bc8830ad978

Test Plan

Test some simple code

python3 -c "import torch; print(torch.cuda.is_available()); print(torch.cuda.device_count())"

Test Result

It can pass.


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: youkaichao <youkaichao@gmail.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly fixes an issue with CUDA library loading priority in Docker by renaming the configuration file to ensure it's loaded last. The change is applied in two necessary locations within the Dockerfile. My review highlights a significant opportunity to improve the maintainability of the Dockerfile by addressing widespread code duplication related to CUDA version string manipulation. I've suggested a refactoring to define and reuse environment variables for different CUDA version formats, which aligns with existing patterns in the file and would make it more robust.

# Ensure CUDA compatibility library is loaded
RUN echo "/usr/local/cuda-$(echo "$CUDA_VERSION" | cut -d. -f1,2)/compat/" > /etc/ld.so.conf.d/cuda-compat.conf && ldconfig
# Ensure CUDA compatibility library is loaded at last to avoid overriding the system libraries
RUN echo "/usr/local/cuda-$(echo "$CUDA_VERSION" | cut -d. -f1,2)/compat/" > /etc/ld.so.conf.d/zzz-cuda-compat.conf && ldconfig
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

While this change is correct, the command substitution $(echo "$CUDA_VERSION" | cut -d. -f1,2) to extract the major and minor CUDA version is repeated over 20 times in this Dockerfile. This makes the file hard to maintain and prone to errors if the logic needs to be updated.

To improve maintainability, consider defining variables for the different CUDA version formats at the beginning of the build stage and reusing them. A pattern for this already exists in this file for PYTHON_VERSION_STR (lines 499-500).

You could add a RUN command in the base stage (e.g., after line 121) to define and export these variables:

RUN echo "export CUDA_VERSION_SHORT=$(echo $CUDA_VERSION | cut -d. -f1,2)" >> /etc/environment && \
    echo "export CUDA_VERSION_NODOT=$(echo $CUDA_VERSION | cut -d. -f1,2 | tr -d '.')" >> /etc/environment && \
    echo "export CUDA_VERSION_DASH=$(echo $CUDA_VERSION | cut -d. -f1,2 | tr '.' '-')" >> /etc/environment

Then, you could source this file in subsequent RUN commands and use the variables. For example, this line would become:

RUN . /etc/environment && echo "/usr/local/cuda-${CUDA_VERSION_SHORT}/compat/" > /etc/ld.so.conf.d/zzz-cuda-compat.conf && ldconfig

Applying this pattern throughout the Dockerfile would significantly reduce redundancy and improve readability. A similar change would be needed in the vllm-base stage.

# Ensure CUDA compatibility library is loaded
RUN echo "/usr/local/cuda-$(echo "$CUDA_VERSION" | cut -d. -f1,2)/compat/" > /etc/ld.so.conf.d/cuda-compat.conf && ldconfig
# Ensure CUDA compatibility library is loaded at last to avoid overriding the system libraries
RUN echo "/usr/local/cuda-$(echo "$CUDA_VERSION" | cut -d. -f1,2)/compat/" > /etc/ld.so.conf.d/zzz-cuda-compat.conf && ldconfig
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Similar to the comment on line 136, this is another instance of repeated logic for CUDA version string manipulation. Applying the suggested refactoring in this vllm-base stage as well would improve maintainability. You can add a RUN command to define and export CUDA_VERSION_SHORT, CUDA_VERSION_NODOT, and CUDA_VERSION_DASH after Python installation (e.g., after line 537) and reuse them in subsequent commands.

@youkaichao
Copy link
Copy Markdown
Member Author

hmmm after a second thought it might break old drivers.

@ehfd
Copy link
Copy Markdown
Contributor

ehfd commented Feb 15, 2026

This PR is ineffective. It just reverts everything to before #30784, where CUDA forward compatibility is completely disabled.

1x NVIDIA A100 with NVIDIA 575.57.08 + CUDA 13.0:

$ mv /etc/ld.so.conf.d/cuda-compat.conf /etc/ld.so.conf.d/zzz-cuda-compat.conf
$ ldconfig
$ ldconfig -p | grep libcuda
        libcudart.so.13 (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.13
        libcudart.so (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so
        libcudadebugger.so.1 (libc6,x86-64) => /usr/local/cuda-13.0/compat/libcudadebugger.so.1
        libcudadebugger.so.1 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libcudadebugger.so.1
        libcuda.so.1 (libc6,x86-64) => /usr/local/cuda-13.0/compat/libcuda.so.1
        libcuda.so.1 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libcuda.so.1
        libcuda.so (libc6,x86-64) => /usr/local/cuda-13.0/compat/libcuda.so
        libcuda.so (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libcuda.so
$ python3 -m vllm.entrypoints.openai.api_server --port "5000" --host 0.0.0.0 --download-dir /workspace/.cache/huggingface/hub --model Qwen/Qwen3-0.6B --served-model-name qwen3 --reasoning-parser deepseek_r1 --tensor-parallel-size "1" --trust-remote-code
RuntimeError: The NVIDIA driver on your system is too old (found version 12090). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx Alternatively, go to: https://pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver.

88plug added a commit to 88plug/vllm that referenced this pull request Feb 18, 2026
The persistent cuda-compat.conf in /etc/ld.so.conf.d/ causes Error 803
on consumer NVIDIA GPUs (GeForce, RTX) when the host driver version is
newer than the container's CUDA toolkit. CUDA forward compatibility is
only supported on datacenter/professional GPUs.

Replace the unconditional ldconfig registration with an opt-in mechanism:
- VLLM_ENABLE_CUDA_COMPATIBILITY=1 enables compat library loading
- VLLM_CUDA_COMPATIBILITY_PATH overrides the default compat path
- Runtime LD_LIBRARY_PATH is set before torch import in env_override.py
- Default is disabled (0) so consumer GPU users are unaffected

This fixes the regression introduced by the persistent cuda-compat.conf
that broke systems with NVIDIA driver 580.x (CUDA 13.0 compatible).

Fixes: vllm-project#32373
Related: vllm-project#33992, vllm-project#34226
88plug added a commit to 88plug/vllm that referenced this pull request Feb 18, 2026
The persistent cuda-compat.conf in /etc/ld.so.conf.d/ causes Error 803
on consumer NVIDIA GPUs (GeForce, RTX) when the host driver version is
newer than the container's CUDA toolkit. CUDA forward compatibility is
only supported on datacenter/professional GPUs.

Replace the unconditional ldconfig registration with an opt-in mechanism:
- VLLM_ENABLE_CUDA_COMPATIBILITY=1 enables compat library loading
- VLLM_CUDA_COMPATIBILITY_PATH overrides the default compat path
- Runtime LD_LIBRARY_PATH is set before torch import in env_override.py
- Default is disabled (0) so consumer GPU users are unaffected

This fixes the regression introduced by the persistent cuda-compat.conf
that broke systems with NVIDIA driver 580.x (CUDA 13.0 compatible).

Fixes: vllm-project#32373
Related: vllm-project#33992, vllm-project#34226
Signed-off-by: Andrew Mello <andrew@88plug.com>
@github-project-automation github-project-automation bot moved this to Done in NVIDIA Feb 26, 2026
@ehfd
Copy link
Copy Markdown
Contributor

ehfd commented Feb 26, 2026

Superseded by #33992

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants