[build] fix priority of cuda-compat libraries in ld loading#34226
[build] fix priority of cuda-compat libraries in ld loading#34226youkaichao wants to merge 1 commit intovllm-project:mainfrom
Conversation
Signed-off-by: youkaichao <youkaichao@gmail.com>
There was a problem hiding this comment.
Code Review
This pull request correctly fixes an issue with CUDA library loading priority in Docker by renaming the configuration file to ensure it's loaded last. The change is applied in two necessary locations within the Dockerfile. My review highlights a significant opportunity to improve the maintainability of the Dockerfile by addressing widespread code duplication related to CUDA version string manipulation. I've suggested a refactoring to define and reuse environment variables for different CUDA version formats, which aligns with existing patterns in the file and would make it more robust.
| # Ensure CUDA compatibility library is loaded | ||
| RUN echo "/usr/local/cuda-$(echo "$CUDA_VERSION" | cut -d. -f1,2)/compat/" > /etc/ld.so.conf.d/cuda-compat.conf && ldconfig | ||
| # Ensure CUDA compatibility library is loaded at last to avoid overriding the system libraries | ||
| RUN echo "/usr/local/cuda-$(echo "$CUDA_VERSION" | cut -d. -f1,2)/compat/" > /etc/ld.so.conf.d/zzz-cuda-compat.conf && ldconfig |
There was a problem hiding this comment.
While this change is correct, the command substitution $(echo "$CUDA_VERSION" | cut -d. -f1,2) to extract the major and minor CUDA version is repeated over 20 times in this Dockerfile. This makes the file hard to maintain and prone to errors if the logic needs to be updated.
To improve maintainability, consider defining variables for the different CUDA version formats at the beginning of the build stage and reusing them. A pattern for this already exists in this file for PYTHON_VERSION_STR (lines 499-500).
You could add a RUN command in the base stage (e.g., after line 121) to define and export these variables:
RUN echo "export CUDA_VERSION_SHORT=$(echo $CUDA_VERSION | cut -d. -f1,2)" >> /etc/environment && \
echo "export CUDA_VERSION_NODOT=$(echo $CUDA_VERSION | cut -d. -f1,2 | tr -d '.')" >> /etc/environment && \
echo "export CUDA_VERSION_DASH=$(echo $CUDA_VERSION | cut -d. -f1,2 | tr '.' '-')" >> /etc/environmentThen, you could source this file in subsequent RUN commands and use the variables. For example, this line would become:
RUN . /etc/environment && echo "/usr/local/cuda-${CUDA_VERSION_SHORT}/compat/" > /etc/ld.so.conf.d/zzz-cuda-compat.conf && ldconfigApplying this pattern throughout the Dockerfile would significantly reduce redundancy and improve readability. A similar change would be needed in the vllm-base stage.
| # Ensure CUDA compatibility library is loaded | ||
| RUN echo "/usr/local/cuda-$(echo "$CUDA_VERSION" | cut -d. -f1,2)/compat/" > /etc/ld.so.conf.d/cuda-compat.conf && ldconfig | ||
| # Ensure CUDA compatibility library is loaded at last to avoid overriding the system libraries | ||
| RUN echo "/usr/local/cuda-$(echo "$CUDA_VERSION" | cut -d. -f1,2)/compat/" > /etc/ld.so.conf.d/zzz-cuda-compat.conf && ldconfig |
There was a problem hiding this comment.
Similar to the comment on line 136, this is another instance of repeated logic for CUDA version string manipulation. Applying the suggested refactoring in this vllm-base stage as well would improve maintainability. You can add a RUN command to define and export CUDA_VERSION_SHORT, CUDA_VERSION_NODOT, and CUDA_VERSION_DASH after Python installation (e.g., after line 537) and reuse them in subsequent commands.
|
hmmm after a second thought it might break old drivers. |
|
This PR is ineffective. It just reverts everything to before #30784, where CUDA forward compatibility is completely disabled. 1x NVIDIA A100 with NVIDIA 575.57.08 + CUDA 13.0: |
The persistent cuda-compat.conf in /etc/ld.so.conf.d/ causes Error 803 on consumer NVIDIA GPUs (GeForce, RTX) when the host driver version is newer than the container's CUDA toolkit. CUDA forward compatibility is only supported on datacenter/professional GPUs. Replace the unconditional ldconfig registration with an opt-in mechanism: - VLLM_ENABLE_CUDA_COMPATIBILITY=1 enables compat library loading - VLLM_CUDA_COMPATIBILITY_PATH overrides the default compat path - Runtime LD_LIBRARY_PATH is set before torch import in env_override.py - Default is disabled (0) so consumer GPU users are unaffected This fixes the regression introduced by the persistent cuda-compat.conf that broke systems with NVIDIA driver 580.x (CUDA 13.0 compatible). Fixes: vllm-project#32373 Related: vllm-project#33992, vllm-project#34226
The persistent cuda-compat.conf in /etc/ld.so.conf.d/ causes Error 803 on consumer NVIDIA GPUs (GeForce, RTX) when the host driver version is newer than the container's CUDA toolkit. CUDA forward compatibility is only supported on datacenter/professional GPUs. Replace the unconditional ldconfig registration with an opt-in mechanism: - VLLM_ENABLE_CUDA_COMPATIBILITY=1 enables compat library loading - VLLM_CUDA_COMPATIBILITY_PATH overrides the default compat path - Runtime LD_LIBRARY_PATH is set before torch import in env_override.py - Default is disabled (0) so consumer GPU users are unaffected This fixes the regression introduced by the persistent cuda-compat.conf that broke systems with NVIDIA driver 580.x (CUDA 13.0 compatible). Fixes: vllm-project#32373 Related: vllm-project#33992, vllm-project#34226 Signed-off-by: Andrew Mello <andrew@88plug.com>
|
Superseded by #33992 |
Purpose
People still report
Error 803: system has unsupported display driver / cuda driver combinationafter #33116 . It is because some docker images specify the driver so file innvidia.conf, which gets suppressed bycuda-compat.conf.This PR adds
zzz-tocuda-compat.confso that it's ordered lastly.Test Plan
Test some simple code
python3 -c "import torch; print(torch.cuda.is_available()); print(torch.cuda.device_count())"Test Result
It can pass.
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.