[Bugfix] Make CUDA compat library loading opt-in to fix consumer GPUs by 88plug · Pull Request #34821 · vllm-project/vllm

88plug · 2026-02-18T16:20:35Z

Purpose

Fix CUDA forward compatibility library loading that causes Error 803 (CUDA_ERROR_SYSTEM_DRIVER_MISMATCH) in Docker containers. The persistent cuda-compat.conf in /etc/ld.so.conf.d/ unconditionally loads the container's CUDA compat libs, which shadow the host-mounted driver when the host has a newer CUDA version than the container's toolkit (e.g., host driver 580.x with CUDA 13.0 support, container built with CUDA 12.x).

This was originally reported on an NVIDIA B200 (datacenter) with driver 580.105.08, but the issue affects any GPU — datacenter or consumer — when the host driver version exceeds the container's CUDA toolkit version. CUDA forward compatibility is only supported on datacenter GPUs and select NGC-ready RTX SKUs (docs), so unconditionally enabling it is incorrect.

This PR makes compat library loading opt-in via VLLM_ENABLE_CUDA_COMPATIBILITY=1.

Fixes #32373
Related: #33992, #34226

Changes

docker/Dockerfile: Replace persistent cuda-compat.conf ldconfig entries with ENV VLLM_ENABLE_CUDA_COMPATIBILITY=0 in both build stages
vllm/env_override.py: Add _maybe_set_cuda_compatibility_path() that:
- Only activates when VLLM_ENABLE_CUDA_COMPATIBILITY=1
- Reads torch CUDA version via importlib.util without importing torch (avoids premature CUDA init)
- Auto-detects compat path from conda or /usr/local/cuda-{version}/compat
- Supports manual override via VLLM_CUDA_COMPATIBILITY_PATH
- Sets LD_LIBRARY_PATH before import torch

Test Plan

pytest tests/cuda/test_cuda_compatibility_path.py -v

21 unit tests covering:

Env var parsing (0/1/true/false/whitespace variants)
Path detection priority (custom override > conda > default /usr/local/cuda-{ver}/compat)
LD_LIBRARY_PATH prepend, deduplication, and no-op when already at front
Graceful handling when no valid compat path exists
_get_torch_cuda_version() with and without torch available

Test Result

21 passed in 1.25s

All pre-commit hooks pass (ruff-check, ruff-format, mypy, typos, SPDX headers).

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update.

gemini-code-assist

Code Review

This pull request addresses a critical bug where CUDA forward compatibility libraries were unconditionally loaded in Docker containers, causing crashes on consumer GPUs. The fix makes this feature opt-in via the VLLM_ENABLE_CUDA_COMPATIBILITY environment variable, which is a sound approach. The implementation in vllm/env_override.py is robust, setting the LD_LIBRARY_PATH before torch is imported to ensure the dynamic linker picks up the correct libraries. The logic for detecting the compatibility library path includes several fallbacks, making it flexible for different environments. The changes in the Dockerfile correctly switch from a hardcoded configuration to using the new environment variable. Overall, this is a well-executed fix for a significant issue.

mergify · 2026-02-18T16:25:11Z

Hi @88plug, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

The persistent cuda-compat.conf in /etc/ld.so.conf.d/ causes Error 803 on consumer NVIDIA GPUs (GeForce, RTX) when the host driver version is newer than the container's CUDA toolkit. CUDA forward compatibility is only supported on datacenter/professional GPUs. Replace the unconditional ldconfig registration with an opt-in mechanism: - VLLM_ENABLE_CUDA_COMPATIBILITY=1 enables compat library loading - VLLM_CUDA_COMPATIBILITY_PATH overrides the default compat path - Runtime LD_LIBRARY_PATH is set before torch import in env_override.py - Default is disabled (0) so consumer GPU users are unaffected This fixes the regression introduced by the persistent cuda-compat.conf that broke systems with NVIDIA driver 580.x (CUDA 13.0 compatible). Fixes: vllm-project#32373 Related: vllm-project#33992, vllm-project#34226 Signed-off-by: Andrew Mello <andrew@88plug.com>

Unit tests for _maybe_set_cuda_compatibility_path() and _get_torch_cuda_version() covering: - Env var parsing (0/1/true/false/whitespace) - Path detection priority (custom > conda > default) - LD_LIBRARY_PATH prepend, dedup, and no-op on already-front - Graceful handling when no valid path exists Signed-off-by: Andrew Mello <andrew@88plug.com>

88plug · 2026-02-18T18:35:56Z

Thanks for the review @gemini-code-assist! Pre-commit checks are now all passing — the initial failure was from the first push before formatting was applied. Added 21 unit tests for the compat path logic in the follow-up commit (6ffca6f).

gemini-code-assist · 2026-02-18T18:36:10Z

That's great to hear, @88plug! Thanks for the update and for adding the comprehensive unit tests for the CUDA compatibility path logic. Ensuring robust test coverage for such critical environment-dependent logic is excellent for maintainability and reliability.

ehfd · 2026-02-19T14:10:58Z

@88plug Why not just PR against my branch and combine? I can accept it.

And please add Signed-off-by: Seungmin Kim <8457324+ehfd@users.noreply.github.com> in the commits either way.

88plug · 2026-02-20T14:55:37Z

@ehfd Sounds good — happy to contribute my tests to your branch. Your PR has the envs.py registration and docs that mine is missing, so combining makes sense.

I've opened https://github.com/ehfd/vllm/pull/1 against your pr branch with:

tests/cuda/test_cuda_compatibility_path.py (21 unit tests covering env var parsing, path detection priority, LD_LIBRARY_PATH manipulation, deduplication, and edge cases)
Both Signed-off-by lines included

All 21 tests pass against your env_override.py implementation with zero changes needed — the implementations are functionally identical.

I'll close this PR once yours has the tests integrated. Looking forward to getting this merged for v0.16.0 RC.

ehfd · 2026-02-20T16:46:25Z

@88plug I have integrated everything and accepted the PR. Also, properly attributed the co-authors (you) as well.

88plug · 2026-02-20T20:32:59Z

Closing — tests contributed to @ehfd's PR #34226 which covers this fix plus envs.py registration and docs. See https://github.com/ehfd/vllm/pull/1

mergify bot added ci/build nvidia bug Something isn't working labels Feb 18, 2026

github-project-automation bot added this to NVIDIA Feb 18, 2026

gemini-code-assist bot reviewed Feb 18, 2026

View reviewed changes

88plug force-pushed the claude/fix-cuda-compat-consumer-gpus branch from ba7241c to 017f664 Compare February 18, 2026 16:28

88plug closed this Feb 20, 2026

github-project-automation bot moved this to Done in NVIDIA Feb 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix] Make CUDA compat library loading opt-in to fix consumer GPUs#34821

[Bugfix] Make CUDA compat library loading opt-in to fix consumer GPUs#34821
88plug wants to merge 2 commits intovllm-project:mainfrom
88plug:claude/fix-cuda-compat-consumer-gpus

88plug commented Feb 18, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

mergify bot commented Feb 18, 2026

Uh oh!

88plug commented Feb 18, 2026

Uh oh!

gemini-code-assist bot commented Feb 18, 2026

Uh oh!

ehfd commented Feb 19, 2026 •

edited

Loading

Uh oh!

88plug commented Feb 20, 2026

Uh oh!

ehfd commented Feb 20, 2026

Uh oh!

88plug commented Feb 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

88plug commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Changes

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

mergify bot commented Feb 18, 2026

Uh oh!

88plug commented Feb 18, 2026

Uh oh!

gemini-code-assist bot commented Feb 18, 2026

Uh oh!

ehfd commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

88plug commented Feb 20, 2026

Uh oh!

ehfd commented Feb 20, 2026

Uh oh!

88plug commented Feb 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

88plug commented Feb 18, 2026 •

edited

Loading

ehfd commented Feb 19, 2026 •

edited

Loading