fix(gguf): Auto-select compatible dtype for GGUF models on Blackwell#30410
fix(gguf): Auto-select compatible dtype for GGUF models on Blackwell#30410kitaekatt wants to merge 2 commits into
Conversation
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
There was a problem hiding this comment.
Code Review
This pull request introduces a mechanism to automatically resolve data type conflicts for GGUF models, particularly on newer hardware like NVIDIA's Blackwell GPUs. The changes correctly identify scenarios where bfloat16 is disallowed by the hardware/quantization format and float16 is disallowed by the model, and fall back to float32. My review found a critical issue with the compute capability check for Blackwell GPUs, which would prevent this feature from working as intended. I've also included a suggestion to improve code quality.
| # bfloat16 has precision issues on SM 120+ devices (Blackwell). | ||
| if current_platform.has_device_capability(120): |
There was a problem hiding this comment.
There seems to be a misunderstanding of NVIDIA's compute capability versions. Blackwell architecture (e.g., RTX 5090) has a compute capability of 10.0, which translates to an integer value of 100 for has_device_capability. The current code uses 120, which corresponds to a future SM 12.0, and will not activate this logic on Blackwell devices. The comment is also misleading.
Other parts of the vLLM codebase already use 100 to refer to Blackwell's compute capability.
Please correct the value to 100 and update the comment for accuracy. Using a constant for this magic number would also improve readability and maintainability.
| # bfloat16 has precision issues on SM 120+ devices (Blackwell). | |
| if current_platform.has_device_capability(120): | |
| # bfloat16 has precision issues on SM 10.0+ devices (Blackwell). | |
| if current_platform.has_device_capability(100): |
There was a problem hiding this comment.
Good catch! Fixed in 318e2d3b4. Changed to has_device_capability(100) for Blackwell SM 10.0 and updated the comment/warning message accordingly.
| import torch | ||
|
|
There was a problem hiding this comment.
Fixed in 318e2d3b4. Removed the redundant local import since torch is already imported at the top of the file.
318e2d3 to
572e029
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
572e029 to
c582994
Compare
c582994 to
7a6a37e
Compare
|
Testing performed: Tested with GGUF models on RTX 5090 (32GB, Blackwell architecture):
Ran models through a local benchmark runner for HumanEval and GSM8K. Verified code coverage by instrumenting the dtype selection path to confirm float16 is correctly applied for GGUF models. Related PRs: This is part of a series of GGUF pipeline fixes for Blackwell GPU compatibility:
|
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
Comment @cursor review or bugbot run to trigger another review on this PR
| preferred, | ||
| ) | ||
| model_config.dtype = preferred | ||
| break |
There was a problem hiding this comment.
Missing fallback when no preferred dtype matches compatible list
Low Severity
If compatible_dtypes is non-empty but contains no dtypes from dtype_preference (float16, bfloat16, float32), the for loop completes without break, leaving model_config.dtype unchanged at its original unsupported value. The code then continues to quant_config.maybe_update_config() and returns without raising an error, silently using an invalid dtype. For GGUF this cannot occur since all its supported dtypes are in the preference list, but it's a latent risk for future quantization methods.
There was a problem hiding this comment.
Acknowledged - technically valid but cannot occur in practice since all quantization methods use float16/bfloat16/float32. Happy to add a defensive else: raise ValueError(...) in a follow-up if desired.
| f"{model_config.dtype} is not supported for quantization " | ||
| f"method {model_config.quantization}. Supported dtypes: " | ||
| f"{supported_dtypes}" | ||
| ) |
There was a problem hiding this comment.
LoRA dtype set before model dtype auto-correction
Medium Severity
In VllmConfig.__post_init__, lora_config.verify_with_model_config() is called before _get_quantization_config(). When lora_dtype is "auto", it copies model_config.dtype (e.g., bfloat16). Then _get_quantization_config() may change model_config.dtype to float32. This leaves lora_config.lora_dtype with the stale bfloat16 value, creating a dtype mismatch between LoRA weights and the model when using LoRA with GGUF on Blackwell.
There was a problem hiding this comment.
Valid concern - this is a pre-existing initialization ordering issue in __post_init__, not introduced by this PR. Worth fixing separately but out of scope here (unless a human reviewer asks for it in which case, sure I'll do it).
7bdf3dd to
28d9d6f
Compare
28d9d6f to
ee1ef34
Compare
Fixes Gemma3 GGUF models failing on Blackwell GPUs with --dtype auto. Problem: - Gemma3 blocks float16 (numerical instability) - GGUF on Blackwell blocks bfloat16 (precision issues) - Only float32 works, but dtype=auto picks bfloat16 → fails Changes: 1. gguf.py: Block bfloat16 on SM 120+ (Blackwell) devices 2. vllm.py: Auto-select compatible dtype when model and quantization restrictions conflict, instead of failing with an error This allows --dtype auto to work correctly with Gemma3 GGUF on Blackwell by automatically falling back to float32. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: Christina <truffle@gmail.com>
ee1ef34 to
4064a25
Compare
Validation Results
Tested on RTX 5090 (Blackwell, SM 120) with all listed PRs cherry-picked together; models listed under each benchmark passed that benchmark in the given environment, while the same models crash or fail without these PRs applied. Rebased to current upstream HEAD and re-validated on RTX 5090 (Blackwell, SM 120). Fix confirmed still necessary — 3 GGUF models crash without it. |
…uf-dtype # Conflicts: # vllm/config/vllm.py
|
Hi @mgoin @Isotr0py — this has been sitting without review for a while. Just rebased on latest Would appreciate a look when you have cycles. Thanks! |
Summary
Adds
_resolve_dtype_conflict()to automatically select float32 when both bfloat16 (GGUF/Blackwell) and float16 (Gemma2/Gemma3) are disallowed.Changes
Root Cause
Gemma2/Gemma3 models disallow float16 (numerical instability), while GGUF on Blackwell disallows bfloat16 (precision issues). Only float32 works for both.
Testing
Tested with Gemma2 GGUF on RTX 5090 - model loads and runs correctly with auto-selected float32.