Skip to content

[Bugfix] Make CUDA compat library loading opt-in to fix consumer GPUs#34821

Closed
88plug wants to merge 2 commits intovllm-project:mainfrom
88plug:claude/fix-cuda-compat-consumer-gpus
Closed

[Bugfix] Make CUDA compat library loading opt-in to fix consumer GPUs#34821
88plug wants to merge 2 commits intovllm-project:mainfrom
88plug:claude/fix-cuda-compat-consumer-gpus

Conversation

@88plug
Copy link
Copy Markdown
Contributor

@88plug 88plug commented Feb 18, 2026

Purpose

Fix CUDA forward compatibility library loading that causes Error 803 (CUDA_ERROR_SYSTEM_DRIVER_MISMATCH) in Docker containers. The persistent cuda-compat.conf in /etc/ld.so.conf.d/ unconditionally loads the container's CUDA compat libs, which shadow the host-mounted driver when the host has a newer CUDA version than the container's toolkit (e.g., host driver 580.x with CUDA 13.0 support, container built with CUDA 12.x).

This was originally reported on an NVIDIA B200 (datacenter) with driver 580.105.08, but the issue affects any GPU — datacenter or consumer — when the host driver version exceeds the container's CUDA toolkit version. CUDA forward compatibility is only supported on datacenter GPUs and select NGC-ready RTX SKUs (docs), so unconditionally enabling it is incorrect.

This PR makes compat library loading opt-in via VLLM_ENABLE_CUDA_COMPATIBILITY=1.

Fixes #32373
Related: #33992, #34226

Changes

  1. docker/Dockerfile: Replace persistent cuda-compat.conf ldconfig entries with ENV VLLM_ENABLE_CUDA_COMPATIBILITY=0 in both build stages
  2. vllm/env_override.py: Add _maybe_set_cuda_compatibility_path() that:
    • Only activates when VLLM_ENABLE_CUDA_COMPATIBILITY=1
    • Reads torch CUDA version via importlib.util without importing torch (avoids premature CUDA init)
    • Auto-detects compat path from conda or /usr/local/cuda-{version}/compat
    • Supports manual override via VLLM_CUDA_COMPATIBILITY_PATH
    • Sets LD_LIBRARY_PATH before import torch

Test Plan

pytest tests/cuda/test_cuda_compatibility_path.py -v

21 unit tests covering:

  • Env var parsing (0/1/true/false/whitespace variants)
  • Path detection priority (custom override > conda > default /usr/local/cuda-{ver}/compat)
  • LD_LIBRARY_PATH prepend, deduplication, and no-op when already at front
  • Graceful handling when no valid compat path exists
  • _get_torch_cuda_version() with and without torch available

Test Result

21 passed in 1.25s

All pre-commit hooks pass (ruff-check, ruff-format, mypy, typos, SPDX headers).


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update.

@mergify mergify bot added ci/build nvidia bug Something isn't working labels Feb 18, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a critical bug where CUDA forward compatibility libraries were unconditionally loaded in Docker containers, causing crashes on consumer GPUs. The fix makes this feature opt-in via the VLLM_ENABLE_CUDA_COMPATIBILITY environment variable, which is a sound approach. The implementation in vllm/env_override.py is robust, setting the LD_LIBRARY_PATH before torch is imported to ensure the dynamic linker picks up the correct libraries. The logic for detecting the compatibility library path includes several fallbacks, making it flexible for different environments. The changes in the Dockerfile correctly switch from a hardcoded configuration to using the new environment variable. Overall, this is a well-executed fix for a significant issue.

@mergify
Copy link
Copy Markdown

mergify bot commented Feb 18, 2026

Hi @88plug, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

The persistent cuda-compat.conf in /etc/ld.so.conf.d/ causes Error 803
on consumer NVIDIA GPUs (GeForce, RTX) when the host driver version is
newer than the container's CUDA toolkit. CUDA forward compatibility is
only supported on datacenter/professional GPUs.

Replace the unconditional ldconfig registration with an opt-in mechanism:
- VLLM_ENABLE_CUDA_COMPATIBILITY=1 enables compat library loading
- VLLM_CUDA_COMPATIBILITY_PATH overrides the default compat path
- Runtime LD_LIBRARY_PATH is set before torch import in env_override.py
- Default is disabled (0) so consumer GPU users are unaffected

This fixes the regression introduced by the persistent cuda-compat.conf
that broke systems with NVIDIA driver 580.x (CUDA 13.0 compatible).

Fixes: vllm-project#32373
Related: vllm-project#33992, vllm-project#34226
Signed-off-by: Andrew Mello <andrew@88plug.com>
@88plug 88plug force-pushed the claude/fix-cuda-compat-consumer-gpus branch from ba7241c to 017f664 Compare February 18, 2026 16:28
Unit tests for _maybe_set_cuda_compatibility_path() and
_get_torch_cuda_version() covering:
- Env var parsing (0/1/true/false/whitespace)
- Path detection priority (custom > conda > default)
- LD_LIBRARY_PATH prepend, dedup, and no-op on already-front
- Graceful handling when no valid path exists

Signed-off-by: Andrew Mello <andrew@88plug.com>
@88plug
Copy link
Copy Markdown
Contributor Author

88plug commented Feb 18, 2026

Thanks for the review @gemini-code-assist! Pre-commit checks are now all passing — the initial failure was from the first push before formatting was applied. Added 21 unit tests for the compat path logic in the follow-up commit (6ffca6f).

@gemini-code-assist
Copy link
Copy Markdown
Contributor

That's great to hear, @88plug! Thanks for the update and for adding the comprehensive unit tests for the CUDA compatibility path logic. Ensuring robust test coverage for such critical environment-dependent logic is excellent for maintainability and reliability.

@ehfd
Copy link
Copy Markdown
Contributor

ehfd commented Feb 19, 2026

@88plug Why not just PR against my branch and combine? I can accept it.

And please add Signed-off-by: Seungmin Kim <8457324+ehfd@users.noreply.github.com> in the commits either way.

@88plug
Copy link
Copy Markdown
Contributor Author

88plug commented Feb 20, 2026

@ehfd Sounds good — happy to contribute my tests to your branch. Your PR has the envs.py registration and docs that mine is missing, so combining makes sense.

I've opened https://github.com/ehfd/vllm/pull/1 against your pr branch with:

  • tests/cuda/test_cuda_compatibility_path.py (21 unit tests covering env var parsing, path detection priority, LD_LIBRARY_PATH manipulation, deduplication, and edge cases)
  • Both Signed-off-by lines included

All 21 tests pass against your env_override.py implementation with zero changes needed — the implementations are functionally identical.

I'll close this PR once yours has the tests integrated. Looking forward to getting this merged for v0.16.0 RC.

@ehfd
Copy link
Copy Markdown
Contributor

ehfd commented Feb 20, 2026

@88plug I have integrated everything and accepted the PR. Also, properly attributed the co-authors (you) as well.

@88plug
Copy link
Copy Markdown
Contributor Author

88plug commented Feb 20, 2026

Closing — tests contributed to @ehfd's PR #34226 which covers this fix plus envs.py registration and docs. See https://github.com/ehfd/vllm/pull/1

@88plug 88plug closed this Feb 20, 2026
@github-project-automation github-project-automation bot moved this to Done in NVIDIA Feb 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working ci/build nvidia

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

[Bug]: Fail to load vLLM on new NVIDIA driver

2 participants