[CI/Build][BugFix] fix cuda/compat loading order issue in docker build by wpc · Pull Request #33116 · vllm-project/vllm

wpc · 2026-01-26T21:13:52Z

Purpose

In PR #30784 we persistents CUDA compat lib path with ldconfig. This caused a CUDA init issue when the docker image runs with new driver without requiring compatibility libs.

$ nvidia-smi
Mon Jan 26 13:09:50 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.105.08             Driver Version: 580.105.08     CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+

$ podman run --entrypoint=/bin/python3  --rm --device nvidia.com/gpu=all docker.io/vllm/vllm-openai:cu130-nightly-aarch64  -c "import torch; print(torch.cuda.is_available(), torch.cuda.device_count())"

/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:182: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 803: system has unsupported display driver / cuda driver combination (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:119.)
 return torch._C._cuda_getDeviceCount() > 0
False 2

the issues is with prefix cuda compat path 00-*, the compat path now have higher priority even than normal CUDA lib

$ podman run --entrypoint=/bin/bash  --rm --device nvidia.com/gpu=all docker.io/vllm/vllm-openai:cu130-nightly-aarch64 -c "ls /etc/ld.so.conf.d/"
00-cuda-compat.conf
00-nvcr-4117280097.conf
000_cuda.conf
987_cuda-13.conf
aarch64-linux-gnu.conf
libc.conf
nvidia.conf
zz-nvcr-4281892442.conf

this patch remove the 00- prefix, lower down the compat lib loading priority to fix the problem

Test Plan

build new image with the fix and run follow, there should be no 830 error and result should be "True 2"

$ podman run --entrypoint=/bin/python3  --rm --device nvidia.com/gpu=all <new-image-tag>  -c "import torch; print(torch.cuda.is_available(), torch.cuda.device_count())"

Test Result

GB200 Driver Version: 580.105.08 (does not need compat libs):

$ podman run --entrypoint=/bin/python3  --rm --device nvidia.com/gpu=all localhost/vllm/vllm-openai:cu130-aarch64-785cd232b  -c "import torch; print(torch.cuda.is_available(), torch.cuda.device_count())"

True 2

H100 Driver Version: 535.183.06 (need compat libs to run cuda13)

$ podman run --entrypoint=/bin/python3  --rm --device nvidia.com/gpu=all localhost/vllm-openai:cu130-x86_64-673c04198  -c "import torch; print(torch.cuda.is_available(), torch.cuda.device_count())"

True 8

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Pengchao Wang <wpc@fb.com>

gemini-code-assist

Code Review

This pull request fixes a CUDA library loading order issue by renaming the ld.so.conf.d configuration file to lower its priority. The change is correct and well-justified. I've also identified a potential issue with how the CUDA_VERSION is parsed throughout the Dockerfile. The current method is fragile and can fail if the version number is an integer. I've provided suggestions to make the version parsing more robust. While I've commented on the lines changed in this PR, this issue is present in other parts of the file and should be addressed consistently.

docker/Dockerfile

wangshangsam · 2026-01-29T09:34:02Z

Thanks a lot for the fix, @wpc !

I've been very busy with MLPerf deadline recently and didn't have much time to get back to this problem. Since this PR is merged, I'll give the latest main a try on our internal dev GB200/GB300 clusters once the deadline is over, and report back if I encounter any further issues.

Coming from vllm-project/vllm#33116 Signed-off-by: Huy Do <huydhn@gmail.com>

ehfd · 2026-01-30T06:05:11Z

@khluu

#33369

Maybe this PR should be backported to a patch release?

Coming from vllm-project/vllm#33116 Signed-off-by: Huy Do <huydhn@gmail.com>

ehfd · 2026-01-30T14:20:38Z

#32373 (comment)

This PR did not really seem to fix things correctly...

iori2333 · 2026-01-31T07:43:38Z

Unfortunately, this fix does not work for game cards with no cuda-compat support. Maybe reverting #30784 is neccessary for these cards...

tahvane1 · 2026-01-31T10:21:32Z

I still get this error with CUDA 13.1 using nightly or cu130-nightly docker builds. I have also compatibility packages installed. We have A5000 cards.

vllm-project#33116) Signed-off-by: Pengchao Wang <wpc@fb.com> Co-authored-by: Ye (Charlotte) Qi <yeq@meta.com>

vllm-project#33116) Signed-off-by: Pengchao Wang <wpc@fb.com> Co-authored-by: Ye (Charlotte) Qi <yeq@meta.com> Signed-off-by: PiratePai <416932041@qq.com> Signed-off-by: Pai <416932041@qq.com>

vllm-project#33116) Signed-off-by: Pengchao Wang <wpc@fb.com> Co-authored-by: Ye (Charlotte) Qi <yeq@meta.com>

mergify bot added ci/build nvidia labels Jan 26, 2026

github-project-automation bot added this to NVIDIA Jan 26, 2026

fix cuda/compat loading order issue in docker build

673c041

Signed-off-by: Pengchao Wang <wpc@fb.com>

wpc force-pushed the cuda-compat-fix branch from 785cd23 to 673c041 Compare January 26, 2026 21:15

gemini-code-assist bot reviewed Jan 26, 2026

View reviewed changes

docker/Dockerfile Show resolved Hide resolved

docker/Dockerfile Show resolved Hide resolved

wpc changed the title ~~fix cuda/compat loading order issue in docker build~~ [BugFix] fix cuda/compat loading order issue in docker build Jan 26, 2026

mergify bot added the bug Something isn't working label Jan 26, 2026

yeqcharlotte requested review from khluu and mgoin January 26, 2026 21:38

yeqcharlotte added the ready ONLY add when PR is ready to merge/full CI is needed label Jan 26, 2026

yeqcharlotte changed the title ~~[BugFix] fix cuda/compat loading order issue in docker build~~ [CI/Build][BugFix] fix cuda/compat loading order issue in docker build Jan 26, 2026

Merge branch 'main' into cuda-compat-fix

d5bd868

ehfd mentioned this pull request Jan 28, 2026

[Bug]: Fail to load vLLM on new NVIDIA driver #32373

Closed

1 task

emricksini-h mentioned this pull request Jan 28, 2026

[Docker][Hotfix] CUDA compatibility enablement #32474

Closed

houseroad approved these changes Jan 29, 2026

View reviewed changes

github-project-automation bot moved this to Ready in NVIDIA Jan 29, 2026

houseroad merged commit 2515bbd into vllm-project:main Jan 29, 2026
95 checks passed

github-project-automation bot moved this from Ready to Done in NVIDIA Jan 29, 2026

ehfd mentioned this pull request Jan 29, 2026

[Docker] Remove CUDA compatibility library loading; fixes #32373 #32377

Closed

5 tasks

huydhn added a commit to pytorch/pytorch-integration-testing that referenced this pull request Jan 30, 2026

[Hotfix] Fix CUDA compat cleanup

b9079b3

Coming from vllm-project/vllm#33116 Signed-off-by: Huy Do <huydhn@gmail.com>

huydhn mentioned this pull request Jan 30, 2026

[Hotfix] Fix CUDA compat cleanup pytorch/pytorch-integration-testing#143

Merged

ehfd mentioned this pull request Jan 30, 2026

[Bug]: Serving model in 0.15.0 Docker container hangs - 0.14.1 worked fine #33369

Closed

1 task

huydhn added a commit to pytorch/pytorch-integration-testing that referenced this pull request Jan 30, 2026

[Hotfix] Fix CUDA compat cleanup (#143)

b6d19b9

Coming from vllm-project/vllm#33116 Signed-off-by: Huy Do <huydhn@gmail.com>

apd10 pushed a commit to apd10/vllm that referenced this pull request Jan 31, 2026

[CI/Build][BugFix] fix cuda/compat loading order issue in docker build (

4269e28

vllm-project#33116) Signed-off-by: Pengchao Wang <wpc@fb.com> Co-authored-by: Ye (Charlotte) Qi <yeq@meta.com>

ehfd mentioned this pull request Feb 6, 2026

[Bugfix] Fix CUDA compatibility path setting for both datacenter and consumer NVIDIA GPUs #33992

Merged

5 tasks

youkaichao mentioned this pull request Feb 10, 2026

[build] fix priority of cuda-compat libraries in ld loading #34226

Closed

5 tasks

haosdent mentioned this pull request Feb 13, 2026

[Usage]: vllm/vllm-openai:v0.15.1 No CUDA GPUs are available,0.10.1.1 is ok #34296

Closed

1 task

ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026

[CI/Build][BugFix] fix cuda/compat loading order issue in docker build (

8a50fbe

vllm-project#33116) Signed-off-by: Pengchao Wang <wpc@fb.com> Co-authored-by: Ye (Charlotte) Qi <yeq@meta.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CI/Build][BugFix] fix cuda/compat loading order issue in docker build#33116

[CI/Build][BugFix] fix cuda/compat loading order issue in docker build#33116
houseroad merged 2 commits intovllm-project:mainfrom
wpc:cuda-compat-fix

wpc commented Jan 26, 2026 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wangshangsam commented Jan 29, 2026

Uh oh!

ehfd commented Jan 30, 2026

Uh oh!

ehfd commented Jan 30, 2026

Uh oh!

iori2333 commented Jan 31, 2026 •

edited

Loading

Uh oh!

tahvane1 commented Jan 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Uh oh!

Conversation

wpc commented Jan 26, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wangshangsam commented Jan 29, 2026

Uh oh!

ehfd commented Jan 30, 2026

Uh oh!

ehfd commented Jan 30, 2026

Uh oh!

iori2333 commented Jan 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tahvane1 commented Jan 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

wpc commented Jan 26, 2026 •

edited by github-actions bot

Loading

iori2333 commented Jan 31, 2026 •

edited

Loading