[Improvement] Persist CUDA compat libraries paths to prevent reset on apt-get#30784
Conversation
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
There was a problem hiding this comment.
Code Review
This pull request correctly persists the CUDA compatibility library path by creating a configuration file in /etc/ld.so.conf.d/, which is a good improvement for the Docker image's robustness. However, the implementation introduces a critical command injection vulnerability by using an unquoted build argument within a command substitution. I have provided specific comments and suggestions to address this. This pattern of unquoted variables appears elsewhere in the Dockerfile, and I strongly recommend a full audit to fix all instances and secure the build process.
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
47bf56c to
83b9527
Compare
83b9527 to
18ff788
Compare
wangshangsam
left a comment
There was a problem hiding this comment.
Looks reasonable, but, in the PR description, could you include a concrete example of where the existing code fails? Cuz
If a user extends this image or runs a debug container and executes apt-get install (which triggers a default ldconfig), the custom compatibility path is wiped from the cache.
This I can understand. But
This causes the container to silently fall back to the host driver's native CUDA version (e.g., 12.4) instead of the container's optimized version (12.9), potentially degrading performance or raising version compatibility mismatch errors.
This I don't quite understand. I thought that, the container's CUDA version is just the container's CUDA version, and the whole /usr/local/cuda-$(echo $CUDA_VERSION | cut -d. -f1,2)/compat/ thing is just to enable compatibility mode (i.e., you can run a later CUDA version on a older driver version)?
mgoin
left a comment
There was a problem hiding this comment.
I see, this makes sense to me as a fragile configuration right now. But I agree with @wangshangsam to clarify
|
Thanks @wangshangsam & @mgoin for the review ! To give a concrete example, I ran two versions of the vLLM Docker image on a cluster node with CUDA 12.4 (driver 550.163.01) installed. The first image is the base In the first docker image (default), I have: In the second (debug), I have: Ultimately, there is only one driver version, but that driver may be compatible with different CUDA versions (including newer ones). Incompatibility issues can arise when code compiled with CUDA 12.9 (or newer) is executed in a Docker container that lacks the necessary compatibility layer, causing it to fall back to the node's version (12.4). In my case, I encountered the following error when loading By running the following command, the issue got resolved: The fix in the PR prevents the issue to appear by making sure the compat is always enabled. |
Signed-off-by: emricksini-h <emrick.birivoutin@hcompany.ai>
Signed-off-by: emricksini-h <emrick.birivoutin@hcompany.ai>
18ff788 to
e64ddb2
Compare
|
Thanks @emricksini-h ! Now this makes sense. |
|
Unfortunately, I think this change doesn't work with newer drivers. PyTorch x vLLM benchmark jobs are using With this change, importing PyTorch fails right away: Here is an example failure https://github.com/pytorch/pytorch-integration-testing/actions/runs/21017403967/job/60426060877#step:19:1452 |
… `apt-get` (vllm-project#30784) Signed-off-by: emricksini-h <emrick.birivoutin@hcompany.ai>
…reset on `apt-get` (vllm-project#30784)" This reverts commit 2a60ac9. Signed-off-by: emricksini-h <emrick.birivoutin@hcompany.ai>
… `apt-get` (vllm-project#30784) Signed-off-by: emricksini-h <emrick.birivoutin@hcompany.ai>
… `apt-get` (vllm-project#30784) Signed-off-by: emricksini-h <emrick.birivoutin@hcompany.ai> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
…reset on `apt-get` (vllm-project#30784)" This reverts commit 2a60ac9.
…reset on `apt-get` (vllm-project#30784)" (#31) This reverts commit 2a60ac9.
…reset on `apt-get` (vllm-project#30784)" (#31) This reverts commit 2a60ac9.
* [Docker][Dev] Fix libnccl-dev version for the CUDA 13.0.1 devel image [Docker][Dev] Fix libnccl-dev version conflict for the CUDA 13.0.1 devel image Further update * feat: Support FA4 for mm-encoder-attn-backend for qwen models * feat: Kernel warmup for vit fa4 * fix: Fix some minor conflicts due to the introduction of flash_attn.cute * Revert "[Docker][Dev] Fix libnccl-dev version for the CUDA 13.0.1 devel image" This reverts commit ab76b28. * chore: Update requirements and revert README.md * chore: Install git for flash_attn cute installation * lint: Fix linting * Revert "[Improvement] Persist CUDA compat libraries paths to prevent reset on `apt-get` (vllm-project#30784)" (#31) This reverts commit 2a60ac9. --------- Co-authored-by: Shang Wang <shangw@nvidia.com>
* [Docker][Dev] Fix libnccl-dev version for the CUDA 13.0.1 devel image [Docker][Dev] Fix libnccl-dev version conflict for the CUDA 13.0.1 devel image Further update * feat: Support FA4 for mm-encoder-attn-backend for qwen models * feat: Kernel warmup for vit fa4 * fix: Fix some minor conflicts due to the introduction of flash_attn.cute * Revert "[Docker][Dev] Fix libnccl-dev version for the CUDA 13.0.1 devel image" This reverts commit ab76b28. * chore: Update requirements and revert README.md * chore: Install git for flash_attn cute installation * lint: Fix linting * Revert "[Improvement] Persist CUDA compat libraries paths to prevent reset on `apt-get` (vllm-project#30784)" (#31) This reverts commit 2a60ac9. --------- Co-authored-by: Shang Wang <shangw@nvidia.com>
… `apt-get` (vllm-project#30784) Signed-off-by: emricksini-h <emrick.birivoutin@hcompany.ai>
Currently, the Dockerfile registers CUDA compatibility libraries using a transient
RUN ldconfig /path/to/compatcommand. This updates the cache but does not persist the configuration.If a user extends this image or runs a debug container and executes
apt-getinstall (which triggers a defaultldconfig), the custom compatibility path is wiped from the cache. This causes the container to silently fall back to the host driver's native CUDA version (e.g., 12.4) instead of the container's optimized version (12.9), potentially degrading performance or raising version compatibility mismatch errors.This PR make this more robust by writing the path to
/etc/ld.so.conf.d/00-cuda-compat.confbefore runningldconfig. This ensures the compatibility layer persists regardless of future package installations or cache rebuilds.