Skip to content

fix(gguf): Auto-select compatible dtype for GGUF models on Blackwell#30410

Open
kitaekatt wants to merge 2 commits into
vllm-project:mainfrom
kitaekatt:fix/30365-gemma3-gguf-dtype
Open

fix(gguf): Auto-select compatible dtype for GGUF models on Blackwell#30410
kitaekatt wants to merge 2 commits into
vllm-project:mainfrom
kitaekatt:fix/30365-gemma3-gguf-dtype

Conversation

@kitaekatt
Copy link
Copy Markdown
Contributor

Summary

Adds _resolve_dtype_conflict() to automatically select float32 when both bfloat16 (GGUF/Blackwell) and float16 (Gemma2/Gemma3) are disallowed.

Changes

  • Add dtype conflict resolution logic in model config
  • Auto-select float32 as compatible dtype when no other option works
  • Log informative message about automatic dtype selection

Root Cause

Gemma2/Gemma3 models disallow float16 (numerical instability), while GGUF on Blackwell disallows bfloat16 (precision issues). Only float32 works for both.

Testing

Tested with Gemma2 GGUF on RTX 5090 - model loads and runs correctly with auto-selected float32.

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a mechanism to automatically resolve data type conflicts for GGUF models, particularly on newer hardware like NVIDIA's Blackwell GPUs. The changes correctly identify scenarios where bfloat16 is disallowed by the hardware/quantization format and float16 is disallowed by the model, and fall back to float32. My review found a critical issue with the compute capability check for Blackwell GPUs, which would prevent this feature from working as intended. I've also included a suggestion to improve code quality.

Comment on lines +57 to +58
# bfloat16 has precision issues on SM 120+ devices (Blackwell).
if current_platform.has_device_capability(120):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There seems to be a misunderstanding of NVIDIA's compute capability versions. Blackwell architecture (e.g., RTX 5090) has a compute capability of 10.0, which translates to an integer value of 100 for has_device_capability. The current code uses 120, which corresponds to a future SM 12.0, and will not activate this logic on Blackwell devices. The comment is also misleading.

Other parts of the vLLM codebase already use 100 to refer to Blackwell's compute capability.

Please correct the value to 100 and update the comment for accuracy. Using a constant for this magic number would also improve readability and maintainability.

Suggested change
# bfloat16 has precision issues on SM 120+ devices (Blackwell).
if current_platform.has_device_capability(120):
# bfloat16 has precision issues on SM 10.0+ devices (Blackwell).
if current_platform.has_device_capability(100):

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Fixed in 318e2d3b4. Changed to has_device_capability(100) for Blackwell SM 10.0 and updated the comment/warning message accordingly.

Comment thread vllm/config/vllm.py
Comment on lines +405 to +406
import torch

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The torch library is already imported at the top of this file (line 19). This local import is redundant and should be removed to improve code clarity and adhere to standard Python style practices.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 318e2d3b4. Removed the redundant local import since torch is already imported at the top of the file.

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Dec 15, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @kitaekatt.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label Dec 15, 2025
@kitaekatt kitaekatt force-pushed the fix/30365-gemma3-gguf-dtype branch from 572e029 to c582994 Compare December 29, 2025 20:41
@mergify mergify Bot removed the needs-rebase label Dec 29, 2025
@kitaekatt kitaekatt force-pushed the fix/30365-gemma3-gguf-dtype branch from c582994 to 7a6a37e Compare January 19, 2026 17:27
@kitaekatt
Copy link
Copy Markdown
Contributor Author

Testing performed:

Tested with GGUF models on RTX 5090 (32GB, Blackwell architecture):

  • bartowski/NousResearch_Hermes-4-14B-GGUF
  • tensorblock/Qwen_Qwen3-30B-A3B-Instruct-2507-GGUF

Ran models through a local benchmark runner for HumanEval and GSM8K. Verified code coverage by instrumenting the dtype selection path to confirm float16 is correctly applied for GGUF models.

Related PRs:

This is part of a series of GGUF pipeline fixes for Blackwell GPU compatibility:

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

Comment @cursor review or bugbot run to trigger another review on this PR

Comment thread vllm/config/vllm.py
preferred,
)
model_config.dtype = preferred
break
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing fallback when no preferred dtype matches compatible list

Low Severity

If compatible_dtypes is non-empty but contains no dtypes from dtype_preference (float16, bfloat16, float32), the for loop completes without break, leaving model_config.dtype unchanged at its original unsupported value. The code then continues to quant_config.maybe_update_config() and returns without raising an error, silently using an invalid dtype. For GGUF this cannot occur since all its supported dtypes are in the preference list, but it's a latent risk for future quantization methods.

Fix in Cursor Fix in Web

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged - technically valid but cannot occur in practice since all quantization methods use float16/bfloat16/float32. Happy to add a defensive else: raise ValueError(...) in a follow-up if desired.

Comment thread vllm/config/vllm.py
f"{model_config.dtype} is not supported for quantization "
f"method {model_config.quantization}. Supported dtypes: "
f"{supported_dtypes}"
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LoRA dtype set before model dtype auto-correction

Medium Severity

In VllmConfig.__post_init__, lora_config.verify_with_model_config() is called before _get_quantization_config(). When lora_dtype is "auto", it copies model_config.dtype (e.g., bfloat16). Then _get_quantization_config() may change model_config.dtype to float32. This leaves lora_config.lora_dtype with the stale bfloat16 value, creating a dtype mismatch between LoRA weights and the model when using LoRA with GGUF on Blackwell.

Fix in Cursor Fix in Web

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid concern - this is a pre-existing initialization ordering issue in __post_init__, not introduced by this PR. Worth fixing separately but out of scope here (unless a human reviewer asks for it in which case, sure I'll do it).

@kitaekatt kitaekatt force-pushed the fix/30365-gemma3-gguf-dtype branch from 7bdf3dd to 28d9d6f Compare January 22, 2026 21:42
@kitaekatt kitaekatt force-pushed the fix/30365-gemma3-gguf-dtype branch from 28d9d6f to ee1ef34 Compare February 5, 2026 00:18
Fixes Gemma3 GGUF models failing on Blackwell GPUs with --dtype auto.

Problem:
- Gemma3 blocks float16 (numerical instability)
- GGUF on Blackwell blocks bfloat16 (precision issues)
- Only float32 works, but dtype=auto picks bfloat16 → fails

Changes:
1. gguf.py: Block bfloat16 on SM 120+ (Blackwell) devices
2. vllm.py: Auto-select compatible dtype when model and quantization
   restrictions conflict, instead of failing with an error

This allows --dtype auto to work correctly with Gemma3 GGUF on Blackwell
by automatically falling back to float32.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Signed-off-by: Christina <truffle@gmail.com>
@kitaekatt
Copy link
Copy Markdown
Contributor Author

Validation Results

vLLM transformers Cherry-picked PRs HumanEval IFEval
HEAD 5.x #30410, #30411, #30412, #30413, #30424, #30434, #30699, #30702, #31464, #33846 gem2-2b-gguf (42.1%), gemma3-1b (26.8%) gem2-2b-gguf (65.6%)
HEAD 4.x #30410, #30411, #30412, #30413, #30424, #30434, #30699, #30702, #31464, #33846 q3-moe-gguf (83.5%) q3-moe-gguf (85.4%)

Tested on RTX 5090 (Blackwell, SM 120) with all listed PRs cherry-picked together; models listed under each benchmark passed that benchmark in the given environment, while the same models crash or fail without these PRs applied.

Rebased to current upstream HEAD and re-validated on RTX 5090 (Blackwell, SM 120). Fix confirmed still necessary — 3 GGUF models crash without it.

@kitaekatt
Copy link
Copy Markdown
Contributor Author

Hi @mgoin @Isotr0py — this has been sitting without review for a while. Just rebased on latest upstream/main (merge commit), so the branch should now be mergeable. Quick summary: Auto-select compatible dtype for GGUF models on Blackwell:when GGUF's bfloat16 restriction conflicts with a model's float16 restriction (e.g. Gemma3 GGUF on Blackwell), falls back to float32 automatically instead of erroring. Rebased on latest main.

Would appreciate a look when you have cycles. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant