Skip to content

[CI/Build][BugFix] fix cuda/compat loading order issue in docker build#33116

Merged
houseroad merged 2 commits intovllm-project:mainfrom
wpc:cuda-compat-fix
Jan 29, 2026
Merged

[CI/Build][BugFix] fix cuda/compat loading order issue in docker build#33116
houseroad merged 2 commits intovllm-project:mainfrom
wpc:cuda-compat-fix

Conversation

@wpc
Copy link
Copy Markdown
Contributor

@wpc wpc commented Jan 26, 2026

Purpose

In PR #30784 we persistents CUDA compat lib path with ldconfig. This caused a CUDA init issue when the docker image runs with new driver without requiring compatibility libs.

$ nvidia-smi
Mon Jan 26 13:09:50 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.105.08             Driver Version: 580.105.08     CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+

$ podman run --entrypoint=/bin/python3  --rm --device nvidia.com/gpu=all docker.io/vllm/vllm-openai:cu130-nightly-aarch64  -c "import torch; print(torch.cuda.is_available(), torch.cuda.device_count())"

/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:182: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 803: system has unsupported display driver / cuda driver combination (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:119.)
 return torch._C._cuda_getDeviceCount() > 0
False 2

the issues is with prefix cuda compat path 00-*, the compat path now have higher priority even than normal CUDA lib

$ podman run --entrypoint=/bin/bash  --rm --device nvidia.com/gpu=all docker.io/vllm/vllm-openai:cu130-nightly-aarch64 -c "ls /etc/ld.so.conf.d/"
00-cuda-compat.conf
00-nvcr-4117280097.conf
000_cuda.conf
987_cuda-13.conf
aarch64-linux-gnu.conf
libc.conf
nvidia.conf
zz-nvcr-4281892442.conf

this patch remove the 00- prefix, lower down the compat lib loading priority to fix the problem

Test Plan

build new image with the fix and run follow, there should be no 830 error and result should be "True 2"

$ podman run --entrypoint=/bin/python3  --rm --device nvidia.com/gpu=all <new-image-tag>  -c "import torch; print(torch.cuda.is_available(), torch.cuda.device_count())"

Test Result

GB200 Driver Version: 580.105.08 (does not need compat libs):

$ podman run --entrypoint=/bin/python3  --rm --device nvidia.com/gpu=all localhost/vllm/vllm-openai:cu130-aarch64-785cd232b  -c "import torch; print(torch.cuda.is_available(), torch.cuda.device_count())"

True 2

H100 Driver Version: 535.183.06 (need compat libs to run cuda13)

$ podman run --entrypoint=/bin/python3  --rm --device nvidia.com/gpu=all localhost/vllm-openai:cu130-x86_64-673c04198  -c "import torch; print(torch.cuda.is_available(), torch.cuda.device_count())"

True 8

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Pengchao Wang <wpc@fb.com>
@wpc wpc force-pushed the cuda-compat-fix branch from 785cd23 to 673c041 Compare January 26, 2026 21:15
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request fixes a CUDA library loading order issue by renaming the ld.so.conf.d configuration file to lower its priority. The change is correct and well-justified. I've also identified a potential issue with how the CUDA_VERSION is parsed throughout the Dockerfile. The current method is fragile and can fail if the version number is an integer. I've provided suggestions to make the version parsing more robust. While I've commented on the lines changed in this PR, this issue is present in other parts of the file and should be addressed consistently.

@wpc wpc changed the title fix cuda/compat loading order issue in docker build [BugFix] fix cuda/compat loading order issue in docker build Jan 26, 2026
@mergify mergify bot added the bug Something isn't working label Jan 26, 2026
@yeqcharlotte yeqcharlotte requested review from khluu and mgoin January 26, 2026 21:38
@yeqcharlotte yeqcharlotte added the ready ONLY add when PR is ready to merge/full CI is needed label Jan 26, 2026
@yeqcharlotte yeqcharlotte changed the title [BugFix] fix cuda/compat loading order issue in docker build [CI/Build][BugFix] fix cuda/compat loading order issue in docker build Jan 26, 2026
@github-project-automation github-project-automation bot moved this to Ready in NVIDIA Jan 29, 2026
@houseroad houseroad merged commit 2515bbd into vllm-project:main Jan 29, 2026
95 checks passed
@github-project-automation github-project-automation bot moved this from Ready to Done in NVIDIA Jan 29, 2026
@wangshangsam
Copy link
Copy Markdown
Collaborator

Thanks a lot for the fix, @wpc !

I've been very busy with MLPerf deadline recently and didn't have much time to get back to this problem. Since this PR is merged, I'll give the latest main a try on our internal dev GB200/GB300 clusters once the deadline is over, and report back if I encounter any further issues.

huydhn added a commit to pytorch/pytorch-integration-testing that referenced this pull request Jan 30, 2026
Coming from vllm-project/vllm#33116

Signed-off-by: Huy Do <huydhn@gmail.com>
@ehfd
Copy link
Copy Markdown
Contributor

ehfd commented Jan 30, 2026

@khluu

#33369

Maybe this PR should be backported to a patch release?

huydhn added a commit to pytorch/pytorch-integration-testing that referenced this pull request Jan 30, 2026
Coming from vllm-project/vllm#33116

Signed-off-by: Huy Do <huydhn@gmail.com>
@ehfd
Copy link
Copy Markdown
Contributor

ehfd commented Jan 30, 2026

#32373 (comment)

This PR did not really seem to fix things correctly...

@iori2333
Copy link
Copy Markdown
Contributor

iori2333 commented Jan 31, 2026

Unfortunately, this fix does not work for game cards with no cuda-compat support. Maybe reverting #30784 is neccessary for these cards...

@tahvane1
Copy link
Copy Markdown

I still get this error with CUDA 13.1 using nightly or cu130-nightly docker builds. I have also compatibility packages installed. We have A5000 cards.

apd10 pushed a commit to apd10/vllm that referenced this pull request Jan 31, 2026
vllm-project#33116)

Signed-off-by: Pengchao Wang <wpc@fb.com>
Co-authored-by: Ye (Charlotte) Qi <yeq@meta.com>
PiratePai pushed a commit to PiratePai/epd_shm that referenced this pull request Feb 3, 2026
vllm-project#33116)

Signed-off-by: Pengchao Wang <wpc@fb.com>
Co-authored-by: Ye (Charlotte) Qi <yeq@meta.com>
Signed-off-by: PiratePai <416932041@qq.com>
Signed-off-by: Pai <416932041@qq.com>
ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026
vllm-project#33116)

Signed-off-by: Pengchao Wang <wpc@fb.com>
Co-authored-by: Ye (Charlotte) Qi <yeq@meta.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working ci/build nvidia ready ONLY add when PR is ready to merge/full CI is needed

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

7 participants