[Bugfix] Make CUDA compat library loading opt-in to fix consumer GPUs#34821
[Bugfix] Make CUDA compat library loading opt-in to fix consumer GPUs#3482188plug wants to merge 2 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request addresses a critical bug where CUDA forward compatibility libraries were unconditionally loaded in Docker containers, causing crashes on consumer GPUs. The fix makes this feature opt-in via the VLLM_ENABLE_CUDA_COMPATIBILITY environment variable, which is a sound approach. The implementation in vllm/env_override.py is robust, setting the LD_LIBRARY_PATH before torch is imported to ensure the dynamic linker picks up the correct libraries. The logic for detecting the compatibility library path includes several fallbacks, making it flexible for different environments. The changes in the Dockerfile correctly switch from a hardcoded configuration to using the new environment variable. Overall, this is a well-executed fix for a significant issue.
|
Hi @88plug, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
The persistent cuda-compat.conf in /etc/ld.so.conf.d/ causes Error 803 on consumer NVIDIA GPUs (GeForce, RTX) when the host driver version is newer than the container's CUDA toolkit. CUDA forward compatibility is only supported on datacenter/professional GPUs. Replace the unconditional ldconfig registration with an opt-in mechanism: - VLLM_ENABLE_CUDA_COMPATIBILITY=1 enables compat library loading - VLLM_CUDA_COMPATIBILITY_PATH overrides the default compat path - Runtime LD_LIBRARY_PATH is set before torch import in env_override.py - Default is disabled (0) so consumer GPU users are unaffected This fixes the regression introduced by the persistent cuda-compat.conf that broke systems with NVIDIA driver 580.x (CUDA 13.0 compatible). Fixes: vllm-project#32373 Related: vllm-project#33992, vllm-project#34226 Signed-off-by: Andrew Mello <andrew@88plug.com>
ba7241c to
017f664
Compare
Unit tests for _maybe_set_cuda_compatibility_path() and _get_torch_cuda_version() covering: - Env var parsing (0/1/true/false/whitespace) - Path detection priority (custom > conda > default) - LD_LIBRARY_PATH prepend, dedup, and no-op on already-front - Graceful handling when no valid path exists Signed-off-by: Andrew Mello <andrew@88plug.com>
|
Thanks for the review @gemini-code-assist! Pre-commit checks are now all passing — the initial failure was from the first push before formatting was applied. Added 21 unit tests for the compat path logic in the follow-up commit (6ffca6f). |
|
That's great to hear, @88plug! Thanks for the update and for adding the comprehensive unit tests for the CUDA compatibility path logic. Ensuring robust test coverage for such critical environment-dependent logic is excellent for maintainability and reliability. |
|
@88plug Why not just PR against my branch and combine? I can accept it. And please add |
|
@ehfd Sounds good — happy to contribute my tests to your branch. Your PR has the envs.py registration and docs that mine is missing, so combining makes sense. I've opened https://github.com/ehfd/vllm/pull/1 against your
All 21 tests pass against your I'll close this PR once yours has the tests integrated. Looking forward to getting this merged for v0.16.0 RC. |
|
@88plug I have integrated everything and accepted the PR. Also, properly attributed the co-authors (you) as well. |
|
Closing — tests contributed to @ehfd's PR #34226 which covers this fix plus envs.py registration and docs. See https://github.com/ehfd/vllm/pull/1 |
Purpose
Fix CUDA forward compatibility library loading that causes Error 803 (
CUDA_ERROR_SYSTEM_DRIVER_MISMATCH) in Docker containers. The persistentcuda-compat.confin/etc/ld.so.conf.d/unconditionally loads the container's CUDA compat libs, which shadow the host-mounted driver when the host has a newer CUDA version than the container's toolkit (e.g., host driver 580.x with CUDA 13.0 support, container built with CUDA 12.x).This was originally reported on an NVIDIA B200 (datacenter) with driver 580.105.08, but the issue affects any GPU — datacenter or consumer — when the host driver version exceeds the container's CUDA toolkit version. CUDA forward compatibility is only supported on datacenter GPUs and select NGC-ready RTX SKUs (docs), so unconditionally enabling it is incorrect.
This PR makes compat library loading opt-in via
VLLM_ENABLE_CUDA_COMPATIBILITY=1.Fixes #32373
Related: #33992, #34226
Changes
docker/Dockerfile: Replace persistentcuda-compat.confldconfig entries withENV VLLM_ENABLE_CUDA_COMPATIBILITY=0in both build stagesvllm/env_override.py: Add_maybe_set_cuda_compatibility_path()that:VLLM_ENABLE_CUDA_COMPATIBILITY=1importlib.utilwithout importing torch (avoids premature CUDA init)/usr/local/cuda-{version}/compatVLLM_CUDA_COMPATIBILITY_PATHLD_LIBRARY_PATHbeforeimport torchTest Plan
21 unit tests covering:
0/1/true/false/whitespace variants)/usr/local/cuda-{ver}/compat)LD_LIBRARY_PATHprepend, deduplication, and no-op when already at front_get_torch_cuda_version()with and without torch availableTest Result
All pre-commit hooks pass (ruff-check, ruff-format, mypy, typos, SPDX headers).
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.